The Complete Guide to Resampling Methods and Regularization in PythonUnderstand how resampling methods and regularization can improve your models and apply these methods in a project setting.

Marco PeixeiroBlockedUnblockFollowFollowingJul 2Please, use resampling methodsResampling and regularization are two important steps that can significantly improve both your model’s performance and your confidence in your model.

In this article, cross-validation will be extensively addressed as it is the most popular resampling method.

Then, ridge regression and lasso will be introduced as regularization methods for linear models.

Afterwards, resampling and regularization will be applied in a project setting.

I hope this article will serve as a reference for one of your future projects, and that it finds its way into your bookmarks.

Let’s get started!The Importance of ResamplingResampling methods are an indispensable tool in modern statistics.

They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.

This allows us to gain more information that could not be available from fitting the model only once.

Usually, the objective of a data science project is to create a model using training data, and have it make predictions on new data.

Hence, the resampling methods allow us to see how the model would perform on data it has not been trained on, without collecting new data.

Cross-validationCross-validation (CV) is used to estimate the test error associated with a model to evaluate its performance or to select the appropriate level of flexibility.

Evaluating a model’s performance is usually defined as model assessment, and model selection is used for selecting the level of flexibility.

This terminology is widely used in the field of data science.

Now, there are different ways to perform cross-validation.

Let’s explore each one of them.

Validation set approachThis is the most basic approach.

It simply involves randomly dividing the dataset into two parts: a training set and a validation set or hold-out set.

The model is fit on the training set and the fitted model is used to make predictions on the validation set.

Validation set schematicAbove is a schematic of the validation set approach.

You have n observations in a dataset, it was randomly split into two parts.

The blue side represents the training set, and the orange side is the validation set.

The numbers simply represent the rows.

Of course, with such a simple approach, there are some drawbacks.

First, the validation test error rate is highly variable depending on which observations are in the training and validation set.

Second, only a small subset of the observations are used to fit the model.

However, we know that statistical methods tend to perform worse when trained on less data.

MSE for the validation set approachAbove, on the left, you see the MSE when the validation set approach was applied only once.

On the right, the process was repeated 10 times.

As you can see, the MSE greatly varies.

This shows the significant variability of the MSE when the validation set approach is used.

Of course, there are methods that address these drawbacks.

Leave-one-out cross-validationThe leave-one-out cross-validation (LOOCV) is a better option than the validation set approach.

Instead of splitting the dataset into two subsets, only one observation is used for validation and the rest is used to fit the model.

LOOCV schematicAbove is a schematic of LOOCV.

As you can see, only one observation is used for validation and the rest is used for training.

The process is then repeated multiple times.

After multiple runs, the error is estimated as:LOOCV estimated errorWhich is simply the mean of the errors of each run.

This method is much better, because it has far less bias, since more observations are used to fit the model.

There is no randomness in the training/validation set splits.

Therefore, we reduce the variability of the MSE, as shown below.

MSE of LOOCVk-fold cross-validationThis approach involves randomly dividing the set of observations into k groups or folds of approximately equal size.

The first fold is treated as a validation set and the model is fit on the remaining folds.

The procedure is then repeated k times, where a different group is treated as the validation set.

k-fold cross-validation schematicHence, you realize that LOOCV is a special case of k-fold cross validation where k is equal to total number of observations n.

However, it is common to set k equal to 5 or 10.

Whereas LOOCV is computationally intensive for large datasets, k-fold is more general and it can be used with any model.

In addition, it often gives more accurate estimates of test error than does LOOCV.

Therefore, to assess and validate your model, the k-fold cross-validation approach is the best option.

Now that we know how cross-validation works and how it can improve our confidence in the model’s performance, let’s see how we can improve the model itself with regularization.

RegularizationRegularization methods effectively prevent overfitting.

Overfitting occurs when a model performs well on the training set, but then performs poorly on the validation set.

We have seen that linear models, such as linear regression and, by extension, logistic regression, use the least squares method to estimate the parameters.

Now, we explore how we can improve linear models by replacing least squares fitting with other fitting procedures.

These methods will yield better prediction accuracy and model interpretability.

But why?.Why use other fitting methods?Least squares fitting works most of the time, but there are situations where it will fail.

For example, if your number of observations n is greater than the number of predictors p, then the least squares estimates will have a low variance and it performs well.

On the other hand, with p is greater than n (more predictors than observations), then variance is infinite and the method cannot be used!Also, multiple liner regression tends to add variables that are not actually associated with the response.

This adds unnecessary complexity to the model.

It would be good if there was a way to automatically perform feature selection, such as to include only the most relevant variables.

To achieve that, we introduce ridge regression and lasso.

These are two common regularization methods, also called shrinkage methods.

Shrinkage methodsShrinking the estimated coefficients towards 0 can significantly improve the fit and reduce the variance of the coefficients.

Here, we explore ridge regression and lasso.

Ridge regressionTraditional linear fitting involves minimizing the RSS (residual sum of squares).

In ridge regression, a new parameter is added, and now the parameters will minimize:Where lambda is a tuning parameter.

This parameter is found using cross-validation as it must minimize the test error.

Therefore, a range of lambdas is used to fit the model and the lambda that minimizes the test error is the optimal value.

Here, ridge regression will include all p predictors in the model.

Hence, it is a good method to improve the fit of the model, but it will not perform variable selection.

LassoSimilarly to ridge regression, lasso will minimizes:Notice that we use the absolute value of the parameter beta instead of its squared value.

Also, the same tuning parameter is present.

However, if lambda is large enough, some coefficients will effectively be 0!.Therefore, lasso can also perform variable selection, making the model much easier to interpret.

ProjectGreat!.We know how regularization and resampling works.

Now, let’s apply these techniques in a project setting.

Fire up a Jupyter notebook and grab the dataset.

If you ever get stuck, the solution notebook is also available.

Let’s get to it!Import librariesLike with any project, we import our usual libraries that will help us perform basic data manipulation and plotting.

Now, we can start our exploratory data analysis.

Exploratory data analysisWe start off by importing our dataset and looking at the first five rows:You should see:Notice that the Unnamed: 0 column is useless.

Let’s take it out.

And now, our dataset looks like this:As you can see, we only have three advertising mediums, and sales is our target variable.

Let’s see how each variable impacts the sales by making a scatter plot.

First, we build a helper function to make a scatter plot:Now, we can generate three different plots for each feature.

And you get the following:Sales with respect to money spend on TV adsSales with respect to money spent on radio adsSales with respect to money spent on newspaper adsAs you can see, TV and radio ads seem to be good predictors for sales, while there seems to be no correlations between sales and newspaper ads.

Luckily, our dataset does not require further processing, so we are ready to move on to modelling right away!ModellingMultiple linear regression — least squares fittingLet’s take a look at what the code looks like, before going through it.

First, we import the LinearRegression and cross_val_score objects.

The first one will allow us to fit a linear model, while the second object will perform k-fold cross-validation.

Then, we define our features and target variable.

The cross_val_score will return an array of MSE for each cross-validation steps.

In our case, we have five of them.

Therefore, we take the mean of MSE and print it.

You should get a negative MSE of -3.

0729.

Now, let’s see if ridge regression or lasso will be better.

Ridge regressionFor ridge regression, we introduce GridSearchCV.

This will allow us to automatically perform 5-fold cross-validation with a range of different regularization parameters in order to find the optimal value of alpha.

The code looks like this:Then, we can find the best parameter and the best MSE with the following:You should see that the optimal value of alpha is 20, with a negative MSE of -3.

07267.

This is a slight improvement upon the basic multiple linear regression.

LassoFor lasso, we follow a very similar process to ridge regression:In this case, the optimal value for alpha is 1, and the negative MSE is -3.

0414, which is the best score of all three models!That’s it!.You now understand how resampling and regularization can greatly improve your model, and you know how to implement each in a project setting.

I hope you found this article useful and that you refer back to it.

Cheers!.