Intuitions on L1 and L2 Regularisation

Here’s a primer on norms:1-norm (also known as L1 norm)2-norm (also known as L2 norm or Euclidean norm)p-normA linear regression model that implements L1 norm for regularisation is called lasso regression, and one that implements L2 norm for regularisation is called ridge regression. To implement these two, note that the linear regression model stays the same:but it is the calculation of the loss function that includes these regularisation terms:Loss function with no regularisationLoss function with L1 regularisationLoss function with L2 regularisationThe regularisation terms are ‘constraints’ by which an optimisation algorithm must ‘adhere to’ when minimising the loss function, apart from having to minimise the error between the true y and the predicted ŷ.1) ModelFor simplicity, we define a simple linear regression model ŷ with one independent variable.Here I have used the deep learning conventions w (‘weight’) and b (‘bias’).In practice, simple linear regression models are not prone to overfitting. As mentioned in the introduction, deep learning models are more susceptible to such problems due to their model complexity.As such, do note that the expressions used in this article are easily extended to more complex models, not limited to linear regression.2.0) Loss function with no regularisationWe then define the loss function as the squared error, where error is the difference between y (the true value) and ŷ (the predicted value). Let’s call this loss function L.Let’s assume our model will be overfitted using this loss function.2.1) Loss function with L1 regularisationBased on our loss function, adding an L1 regularisation term to L looks like this:where the regularisation parameter λ > 0 is manually tuned. Note that |w| is differentiable everywhere except when w=0, as shown below. We will need this later.2.2) Loss function with L2 regularisationAdding an L2 regularisation term to L looks like this:where, again, λ > 0.3) Gradient descentNow, let’s use gradient descent optimisation to find w.Evaluating the gradient of L, L0 and L1 w.r.t. w gives us:L:L1:L2:4) How is overfitting prevented?Let’s perform the following substitution on the equations above:η = 1,H = 2x(wx+b-y)Thus we have as follows:L:L1:L2:Observe the differences between the weight updates with and without the regularisation parameter λ.4.1) L vs..{L1 and L2}Intuition A:Let’s say with Eqn..0, executing w-H gives us a w value that leads to overfitting..Then, intuitively, Eqns..1.1–2 will reduce the chances of overfitting because introducing λ makes us shift away from the very w that was going to cause us overfitting problems in the previous sentence.Intuition B:Let’s say an overfitted model means that we have a w value that is perfect for our model..‘Perfect’ meaning if we substituted the data (x) back in the model, our prediction ŷ will be very very close to the true y..Sure, it’s good, but we don’t want perfect..Why?.Because this means our model is only meant for the dataset which we trained on..This means our model will produce predictions that are far off from the true value for other datasets..So we settle from less than perfect, with the hope that our model can also get close predictions with other data..To do this, we ‘taint’ this perfect w in Eqn.. More details

Leave a Reply