Machine Learning: Regularization and Over-fitting Simply Explained

Machine Learning: Regularization and Over-fitting Simply ExplainedKshitiz SirohiBlockedUnblockFollowFollowingFeb 12http://blog.


com/how-to-know-your-cat-is-your-true-soulmate-this-valentines-day/I am going to give intuitive understanding of Regularization method in as simple words as possible.

Firstly, I will discuss some basic ideas, so if you think you are already families with those, feel free to move ahead.

A Liner ModelA liner model is the one that follows a straight line in the prediction model.

It can have single attribute to predict the value, or multiple attributes to predict the value, and equation looks like this.

(a)Here theta-0 is the intercept and theta-1 to theta-n are the slopes corresponding to their attribute X-1 to X-n.

Cost FunctionMachine Leaning: Cost Function and Gradient DescendThere are two ways to tell a story, one is the hard way where you are expected to meet the standards of the speaker or…towardsdatascience.

comThe cost function determines how much difference there is between your predicted hypothesis h(x) and actual points.

Since we are first considering a liner model, let’s see how it looks on a graph.

One that only has two points and one that has many points.

However, do you think a liner model or a starlight line can represent data that looks something like this:There could be so many possibilities to fit a straight line in this kind of dataset.

Therefore, we starts to use a polynomial equation of the form shown below:What it does is it starts to form a curved line which can better represent the data points in comparison with a straight line.

When we had only one theta, that means we had only one slope of direction and hence we get a straight line, but, if we have many thetas then it means many slopes and hence our line can change direction in many different ways.

See the picture below.




aspx?go=Products/Origin/DataAnalysis/CurveFittingThe way we want our cost function to be minimum in case of straight line, we also want it to be minimum in case of polynomial line.

We use gradient descend to fit a best possible line by continuously updating all the thetas we have in our equation.

What do we need Regularization for?Answer — To prevent the model from overfitting.


com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76Underfitted: We see that the hypothetical line that we draw does not really follows the same trend as the points does.

This way, our model is not giving a close picture of our data.

Solution: Make a polynomial equation that creates a curved line rather than straight line.

Goodfitted: By using a polynomial equation, you add complexity to the line that can take different kinds of shape, whereas, if you have single variable say ‘X’ and predicting ‘Y’, then you are just creating a single line.

Overfitting: If you know that by making your equation polynomial you can shape it up in order to match your data points, however, if you are shaping hypothetical line up to the extent where it is trying to pass though every data point possible, then you say that your model is overfit.

 Why it creates a problem?.Because when you were to predict something in future, your model would not be sure where the line is going to take turn, since it does not generalize the whole model but rather individual data points.

 Solution: RegularizationRegularizationSince we know by changing the slope we can change the direction of the line and we know that our model has too much precision, so would you suggest removing the higher degree terms from the end of equation?.No.

Wrong approachWhat if you keep all the higher degree terms but rather manipulate the slope associated with each term.

Remember each term means a attribute in your dataset, for example, x1 — sales, x2 — profit, x3 — expenditure and so on.

How do we do that?With the help of method called regularization, you increase the value of the terms associated with each slope(theta) to a higher value and that slope associated with each term will be reduced to a lower value.

Note we are not eliminating the higher degree terms, we are increasing it’s value to penalize its slope.

Visualize the difference1- If the slope is 1, then for each unit change in x, there will be a unit change in y.

The equation will be “y=mx+c” where m=1, therefore y=x.

2- If the slope is 2, the equation will be y=2x+c.

Meaning for half unit change, in x, y will change by one unit.

Therefore, for lesser value of x we have higher slope.

3- If the slope is 0.

5, then the equation is y=0.


Meaning for 2 unit change in x, y will change by 1 unit.

Therefore, higher the value of x, lower is the slope.

This means slope and values of variables associated with it are inversely proportional.

Consequently, it makes sense to increase the value of attributes ‘x’ in order to mold their slope so that they do not overfit into the data.

SummaryIn simple words, when you introduce Lambda in the equation, the model becomes generalized and gives a broader picture of training set.

If it were not for that lambda, the model would try to fit each and every point in training set, and hence fail during the testing phase because it would not know where to move next once new data shows up.


. More details

Leave a Reply