Lasso And Ridge Regression

Table of contents

Reading Time: 4 minutes

In this blog, we will learn about lasso regression and ridge regression techniques of regression. We will compare and analyze the methods in detail.

Introducing Linear Models

Linear regression is a type of linear model which is the most basic and commonly used predictive algorithm.
This can not be dissociated from its simple, yet effective architecture. A linear model assumes a linear relationship between input variable(s) 𝑥 and an output variable y. The equation for a linear model looks like this:

In this equation 1.1, we show a linear model with n number of features.
w is the coefficient (or weights) assigned to each feature – an indicator of their significance to the outcome y.
For example, we assume that temperature is a larger driver of ice cream sales than whether it’s a public holiday.
The weight assigned to temperature in our linear model will be larger than the public holiday variable.

A linear model’s goal is to optimize the weight (b) via the cost function in equation 1.2. The cost function calculates the error between predictions and actual values, represented as a single real-valued number. The cost function is the average error across n samples in the dataset, represented below as:

In the equation above, yi is the actual value and that is the predicted value from our linear equation, where M is the number of rows and P is the number of features.

Regularization

When it comes to training models, there are two major problems one can encounter: overfitting and underfitting.

Overfitting happens when the model performs well on the training set but not so well on unseen (test) data.
Underfitting happens when it neither performs well on the train set nor on the test set.

Particularly, regularization implements to avoid overfitting of the data, especially when there is a large variance between train and test set performances.
With regularization, the number of features in training is constant, yet the magnitude of the coefficients (w) seen in equation 1.1 reduces.
Consider the image of coefficients below to predict house prices. While there are quite a number of predictors, RM and RAD have the largest coefficients.
These two features drove the housing prices more significantly leading to overfitting.

There are different ways of reducing model complexity and preventing overfitting in linear models. This includes ridge and lasso regression models.

Introduction to Lasso Regression

Lasso Regression is a regularization technique used for feature selection using a Shrinkage method also referred to as the penalized regression method.
Lasso is short for Least Absolute Shrinkage and Selection Operator, which uses both for regularization and model selection.
If a model uses the L1 regularization technique, then known as lasso regression.

Lasso Regression for Regularization

This shrinkage technique determines the coefficients in the linear model from equation 1.1.
The equation shrunk towards the central point as the mean by introducing a penalization factor called the alpha α (or sometimes lambda) values.

Alpha (α) is the penalty term that denotes the amount of shrinkage (or constraint).
With alpha set to zero, we will find that this is the equivalent of the linear regression model from equation 1.2, and a larger value penalizes the optimization function.
Therefore, lasso regression shrinks the coefficients and helps to reduce the model complexity and multi-collinearity.
Alpha (α) is any real-value number between zero and infinity; the larger the value, the more aggressive the penalization is.

Lasso Regression for Model Selection

Since the coefficients shrunk towards a mean of zero, less important features in a dataset eliminate when penalized.
The shrinkage of these coefficients based on the alpha value provided leads to some form of automatic feature selection.

Ridge Regression

Similar to the lasso regression, ridge regression puts a similar constraint on the coefficients by introducing a penalty factor.
However, while lasso regression takes the magnitude of the coefficients, ridge regression takes the square.
Ridge regression is also referred to as L2 Regularization.

Why Lasso can be Used for Model Selection, but not Ridge Regression?

Considering the geometry of both the lasso (left) and ridge (right) models, the elliptical contours (red circles) are the cost functions for each.
Relaxing the constraints introduced by the penalty factor leads to an increase in the constrained region (diamond, circle).
Doing this continually, we will hit the center of the ellipse, where the results of both lasso and ridge models are similar to a linear regression model.
However, both methods determine coefficients by finding the first point where the elliptical contours hit the region of constraints.
Since lasso regression takes a diamond shape in the plot for the constrained region, each time the elliptical regions intersect with these corners, at least one of the coefficients becomes zero.
This is impossible in the ridge regression model as it forms a circular shape and therefore values can shrink close to zero, but never equal to zero.

Conclusion

We have seen an implementation of ridge and lasso regression models and the theoretical and mathematical concepts behind these techniques. Some of the key takeaways from this blog include:

The cost function for both ridge and lasso regression are similar. However, ridge regression takes the square of the coefficients and lasso takes the magnitude.
Lasso regression can be used for automatic feature selection, as the geometry of its constrained region allows coefficient values to be inert to zero.
An alpha value of zero in either ridge or lasso model will have results similar to the regression model.
The larger the alpha value, the more aggressive the penalization.