Machine Leaning: Cost Function and Gradient Descend

Let’ see…Cost FunctionCost function basically means how much far away your predicted line is from the actual points that we were already given.

In other words, you had some points already given to you, after that you predicted some value of θ0 and θ1, using that, you draws a line on the graph; after doing that you realize that new line don’t exactly touches upon all three data points you already had, so now you calculate how far away the original points and your predicted line is.

And that you can calculate using cost function.

The formula for that is as follows:Let’s break it down and see what that means.

The first term 1/2m is a constant term, where m means the number of data points we already have, in our case it’s 3.

Then we have a summation sign, this sign means for each changing value in subscript ‘i’ we keep adding the result.

The term h(x^i) means the output of our hypothesis for particular value of i, in other words the line you are prediction using equation h(x)=θ0+xθ1 and term y^i means the value of data point we already had.

The value ‘i’ means the number of data points we have already calculated the difference of.

To get the picture more clear, let’s look at the examples:Middle lineOriginal points have values (1,1), (2,2), (3,3).

θ0 and θ1 are predicted to be 0 and 1 respective.

Using hypothesis equation we drew a line and now want to calculate the cost.

The line we drew passes through same exact points as we were already given.

So our hypothesis value h(x) is 1, 2, 3 and the value of y^i is also 1, 2, 3.

We can see that the cost function gives us zero for the middle line.

The red line below is our hypothesized line and black dots are the points we had.

Upper Lineθ0 and θ1 are predicted to be 1.

5 and 1.

25 respectively.

Meaning that the intercept is 1.

5 on y-axis and for each unit chance in x, the hypothesis h(x) change by 1.

25 on y axis.

With that, we calculated our h(x) values as follows — (1.

5, 2.

75 and 4).

And y^i (the original data points) remains the same — (1, 2, 3).

Using the cost function, we get the following value:Lower LineThe value of θ0 and θ1 for lower line is 1.

25 and .

75 respectively.

That means it intercepts the y-axis at 1.

25 and for each unit change in the value of x, hypothesis h(x) would change by rate of 0.

75.

We already know that the value of original points y is (1, 2 and 3) and the values of our predicted points h(x) is— 1.

25, 1.

5, 2.

Now, using the cost function we can calculate the cost as shown in the figure below.

Reducing the Cost FunctionYou remember the values of theta-0 and theta-1 that you predicted above, since they were just the predictions, only middle one was the perfect one.

But in a real life scenario, finding a perfect value of theta-0 and theta-1 is next to impossible.

However, you do have the power to manipulate the values of theta-0 and theta-1 in a way that for any given set of values (x1, x2, x3….

xn) you are finding the line that has lowest value of cost function.

Let’s come up with different set of values of theta-0 and theta-1.

theta-0 and theta-1 are 0 and 1.

42 respectively.

The line it creates will look something like the one shown below.

But, since you need to reduce your cost, you need to create a line that fits those 3 points.

Something like this:So you use a cost reduction method called the gradient descend.

Always keep in mind that you just reduce the value of theta-0 and theta-1, and by doing that, you come from that red line over there to the black line down.

See below for better understanding.

Since we know changing the value of theta-0 and theta-1, the orientation of line can change, and to reach a line that fit as closely as possible to those three points we reduce the value of all the thetas (in our case just 2 thetas) bit by bit in such as way that we reach a minimum value of cost function.

The term alpha means — with how much magnitude you are reducing your value.

Theta-j here represents each individual theta you have in your solution, so you run this equation for all the thetas, which in our case are two, but can also be three, four or ten depending upon problem at hand.

So, the top line in the picture above had certain value of theta-0 and theta-1, then, using that formula over here, you reduce the value of all the thetas you have in your equation by some magnitude alpha and moved a bit lower with your predicted line.

Then you again run that formula, reduce the values of thetas, see what that line looks like, calculate the cost and get ready for next iteration.

Then you again reduce the values of theta, again look at the line and calculate cost.

You keep doing that until your line reaches to the point where the cost is minimum, which in our case can be seen in the picture above having value 1, 2, 3.

A perfect line with cost zero against the original data points 1, 2, 3.

The break down of that formula makes more sense, see in the picture below.

Summary:1- You have some data points.

2- Using them you calculate values of thetas and draw the figure using hypothesis equation.

3- You calculate the cost using cost function, which is the distance between what you drew and original data points.

4- You see that the cost function giving you some value that you would like to reduce.

5- Using gradient descend you reduce the values of thetas by magnitude alpha.

6- With new set of values of thetas, you calculate cost again.

7- You keep repeating step-5 and step-6 one after the other until you reach minimum value of cost function.

.

. More details

Leave a Reply