Lasso and Ridge Regularization – A Rescuer From Overfitting

Rashmi Manwani 20 Sep, 2021

7 min read

This article was published as a part of the Data Science Blogathon

Introduction

OVERFITTING! We do not even spend a single day without encountering this situation and then try different options to get the correct accuracy of the model on the test dataset. But what if I tell you there exists a technique that inflicts a penalty on the model if it advances towards overfitting. Yeah, Yeah, you have heard it correct. We have some saviours that rescue our model from overfitting. Before moving further onto our rescuers, let us first understand overfitting with a real-world scenario:

Fig 1. Relocation from the hot region to the cold region

Suppose you have lived in a hot region all your life till graduation, and now for some reason, you have to move to a colder one. As soon as you move to a colder region, you feel under the weather because you need time to adapt to the new climate. The fact that you cannot simply adjust to the new environment can be called Overfitting.

In technical terms, overfitting is a condition that arises when we train our model too much on the training dataset that it focuses on noisy data and irrelevant features. Such a model runs with considerable accuracy on the training set but fails to generalize the attributes in the test set.

Why cannot we accept an Overfitted Model?

An overfitted model cannot recognize the unseen data and will fail terribly on given some new inputs. Understanding this with our previous example, if your body is fit to only one geographical area having a specific climate, then it cannot adapt to the new climate instantly.

For graphs, we can recognize overfitting by looking at the accuracy and loss during training and validation.

Fig 2. Training and Validation Accuracy

loss plot | Lasso and Ridge Regularization

Fig 3. Training and Validation Loss

Mark that the training accuracy (in blue) strikes 100%, but the validation accuracy (in orange) reaches 70%. Training loss falls to 0 while the validation loss attains its minimum value just after the 2^nd epoch. Training further enforces the model focus on noisy and irrelevant features for prediction, and thus the validation loss increases.

To get more insights about overfitting, it is fundamental to understand the role of variance and bias in overfitting:

What is Variance?

Variance tells us about the spread of the data points. It calculates how much a data point differs from its mean value and how far it is from the other points in the dataset.

What is Bias?

It is the difference between the average prediction and the target value.

The relationship of bias and variance with overfitting and underfitting is as shown below:

bias variance trade off | Lasso and Ridge Regularization

Fig 4. Bias and Variance w.r.t Overfitting and Underfitting

Low bias and low variance will give a balanced model, whereas high bias leads to underfitting, and high variance lead to overfitting.

Bias and variance | Lasso and Ridge Regularization

Fig 5. Bias Vs Variance

Low Bias: The average prediction is very close to the target value

High Bias: The predictions differ too much from the actual value

Low Variance: The data points are compact and do not vary much from their mean value

High Variance: Scattered data points with huge variations from the mean value and other data points.

To make a good fit, we need to have a correct balance of bias and variance.

What is Regularization?

Regularization is one of the ways to improve our model to work on unseen data by ignoring the less important features.
Regularization minimizes the validation loss and tries to improve the accuracy of the model.
It avoids overfitting by adding a penalty to the model with high variance, thereby shrinking the beta coefficients to zero.

Regularization | Lasso and Ridge Regularization

Fig 6. Regularization and its types

There are two types of regularization:

Lasso Regularization
Ridge Regularization

What is Lasso Regularization (L1)?

It stands for Least Absolute Shrinkage and Selection Operator
It adds L1 the penalty
L1 is the sum of the absolute value of the beta coefficients

Cost function = Loss + λ + Σ ||w||
Here,
Loss = sum of squared residual
λ = penalty
w = slope of the curve

What is Ridge Regularization (L2)

It adds L2 as the penalty
L2 is the sum of the square of the magnitude of beta coefficients

Cost function = Loss + λ + Σ ||w||²
Here,
Loss = sum of squared residual
λ = penalty
w = slope of the curve

λ is the penalty term for the model. As λ increases cost function increases, the coefficient of the equation decreases and leads to shrinkage.

Now its time to dive into some code:

For comparing Linear, Ridge, and Lasso Regression I will be using a real estate dataset where we have to predict the house price of unit area.

Dataset looks like this:

Fig 7. Real Estate Dataset

Dividing the dataset into train and test sets:

X = df.drop(columns = ['Y house price of unit area', 'X1 transaction date', 'X2 house age'])
Y = df['Y house price of unit area']
x_train, x_test,y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)

Fitting the model on Linear Regression:

lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)
lin_reg_y_pred = lin_reg.predict(x_test)
mse = mean_squared_error(y_test, lin_reg_y_pred)
print(mse)

The Mean Square Error for Linear Regression is: 63.90493104709001

The coefficients of the columns for the Linear Regression model are:

beta coefficient | Lasso and Ridge Regularization

Fig 8. Beta Coefficients for Linear Regression

Fitting the model on Lasso Regression:

from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(x_train, y_train)
y_pred_lasso = lasso.predict(x_test)
mse = mean_squared_error(y_test, y_pred_lasso)
print(mse)

The Mean Square Error for Lasso Regression is: 67.04829587817319

The coefficients of the columns for the Linear Regression model are:

Fig 9. Beta Coefficients for Lasso Regression

Fitting the model on Ridge Regression:

from sklearn.linear_model import Ridge
ridge = Ridge()
ridge.fit(x_train, y_train)
y_pred_ridge = ridge.predict(x_test)
mse = mean_squared_error(y_test, y_pred_ridge)
print(mse)

The Mean Square Error for Ridge Regression is: 66.07258621837418

The coefficients of the columns for the Linear Regression model are:

Beta Coefficients for Ridge regularization

Fig 10. Beta Coefficients for Ridge Regression

Comparing the coefficients of the Lasso and Ridge Regularization models

plt.figure(figsize=(30,6))
x = ['Linear', 'Lasso', 'Ridge']
y1 = np.array([-0.004709, -0.005994, 0.005700])
y2 = np.array([1.007691, 0.958896,  1.135925])
y3 = np.array([221.632669, 0.000000, 7.304642])
y4 = np.array([-8.841321, -0.000000, -0.915969])

fig, axes = plt.subplots(ncols=1, nrows=1)
plt.bar(x, y1, color = 'black')
plt.bar(x, y2, bottom=y1, color='b')
plt.bar(x, y3, bottom=y1+y2, color='g')
plt.bar(x, y4, bottom=y1+y2+y3, color='r')

plt.xlabel("Models")
plt.ylabel("Coefficients")
plt.legend(["X3", "X4", "X5", "X6"])
plt.title("Comparing coefficients of different models")
axes.set_xticklabels(['Linear', 'Lasso', 'Ridge'])

comparing coefficient of different models

Fig 11. Comparison of Beta Coefficients

Inspecting the coefficients, we can see that Lasso and Ridge Regression had shrunk the coefficients, and thus the coefficients are close to zero. On the contrary, Linear Regression still has a substantial value of the coefficient for the X5 column.\

Comparing Lasso and Ridge Regularization techniques

Fig 12. Comparison of L1 Regularization and L2 Regularization

Conclusion

We learned two different types of regression techniques, namely Lasso and Ridge Regression which can be proved effective for overfitting. These techniques make a good fit model by adding a penalty and shrinking the beta coefficients. It is necessary to have a correct balance of the Bias and Variance to control overfitting.

Yayyy! You’ve made it to the end of the article and successfully gotten the hang of these topics of Bias and Variance, Overfitting and Underfitting, and Regularization techniques.😄

Happy Learning! 😊

I’d be obliged to receive any comments, suggestions, or feedback.

You can find the complete code here.

Stay tuned for upcoming blogs!

Connect on LinkedIn: https://www.linkedin.com/in/rashmi-manwani-a13157184/

Connect on Github: https://github.com/Rashmiii-00

Fig 5: http://scott.fortmann-roe.com/docs/BiasVariance.html

About the Author:

Rashmi Manwani

Passionate to learn about Machine Learning topics and their implementation. Thus, finding my way to develop a strong knowledge of the domain by writing appropriate articles on Data Science topics.

The media shown in this article on Interactive Dashboard using Bokeh are not owned by Analytics Vidhya and are used at the Author’s discretion.