That explains why the least squared loss works for a wide range of problems.
The underlying noise is very often Gaussian, because of the CLT (Central Limit Theorem), and minimizing the squared error turns out to be the right thing to do!We are putting in ½ so that the ‘2’ coming from the derivation of the cost function, while minimizing it for calculating gradient, cancels out and makes gradient more concise.
Anyway, it is a constant does not matter in calculating theta’s value (as we are differentiating it).
Also, we are averaging the function over m, that is the total number of data points in the training dataset.
This is so that the value of the calculated gradient does not change the scale and always averages out to a central value in case a new data point comes in.
Implementation With Gradient DescentBefore we implement gradient descent, knowing the intuition behind it is a must.
Since we are now done with that part, let’s dive into the actual implementation of it.
We want calculate the value for 0 and 1 but we can have multiple features (>=2).
In that case, the general formula to calculate consecutive step sizes will beGeneral formula for getting consecutive theta valuewhere α is the learning rate.
We can now infer from the formula that the α influences the size of the step a lot as we are multiplying the gradient with the alpha for every iteration.
Enough of mathematics, let us start with the actual code.
To start, we will initialize a random value for the parameters to be optimized, 0 and 1.
We will write two functions to calculate cost and gradient descent by iterating, and store them in two distinct numpy arrays.
The above mentioned formulae have been used in calculating the cost and the consecutive values for theta using gradient.
def cost(theta,X,y): ''' Calculates cost of the function.
X & y have their usual meaning.
theta – vector of coefficients.
''' m = len(y) # Calculating Cost c = (1/2*m) * np.
sum(np.
square((X.
dot(theta))-y)) return cdef gradient_descent(X,y,theta,alpha,iterations): ''' returns array of thetas, cost of every iteration X – X matrix with added bias.
y – target variable matrix theta – matrix of regression coefficients alpha – learning rate iteration – number of iteration to be run ''' #Getting number of observations.
m = len(y) # Initializing cost and theta's arrays with zeroes.
thetas = np.
zeros((iterations,2)) costs = np.
zeros(iterations) # Calculating theta for every iteration.
for i in range(iterations): theta = theta – (1/m)*alpha*(X.
T.
dot((X.
dot(theta))-y)) thetas[i,:] = theta.
T costs[i] = cost(theta,X,y) return theta,thetas,costs# Learning Ratealpha = 0.
01# Number of iterationsiterations = 3000# Initializing a random value to give algorithm a base value.
theta = np.
random.
randn(2,1)# Adding a biasing constant of value 1 to the features array.
X_bias = np.
c_[np.
ones((len(X),1)),X]# Running Gradient Descenttheta,thetas,costs = gradient_descent(X_bias,y,theta,alpha,iterations)# printing final values.
print('Final Theta 0 value: {:0.
3f}.Final Theta 1 value: {:0.
3f}'.
format(theta[0][0],theta[1][0]))print('Final Cost/MSE(L2 Loss) Value: {:0.
3f}'.
format(costs[-1]))Output:Final Theta 0 value: 9.
448Final Theta 1 value: 2.
015Final Cost/MSE(L2 Loss) Value: 411.
001In the code, we can see that we have run 3000 iterations.
But, that was something passed to the function explicitly.
In real life scenarios, we do not do that.
The actual stop point for a gradient descent to stop running should be when step size approaches zero.
AnalysisLet us check how the L2 Loss reduces along with increasing iterations by plotting a graph.
# Plotting Line Plot for Number of Iterations vs MSEplt.
plot(range(iterations),costs)plt.
xlabel('Number of Iterations')plt.
ylabel('Mean Squared Error')plt.
title('Cost vs Iterations Analysis')Number of Iterations vs MSE (Mean Squared Error): The Elbow MethodIn the above graph, we see that initially, the error reduces significantly.
But as iterations increase, there is not much reduction seen in the error.
It nearly stabilizes after a certain value, or we can say that it decreases negligibly.
That is the beauty of the gradient descent.
When it starts approaching local minimum, it starts to take very very small steps towards it in order to not overstep it.
On the other hand, when it is far away, to reduce computation time it takes bigger steps to reach the local minimum faster.
So, to select the best possible value of the number of iterations using this graph, we use the elbow method.
Simply, the point at which the value of MSE starts to decrease negligibly is our value for the number of iterations (the elbow of the graph).
There are various types of Gradient Descent as well.
What we did above is known as Batch Gradient Descent.
The other types are:Stochastic Gradient Descent.
Mini Batch Gradient Descent.
ConclusionGradient Descent can be used to optimize parameters for every algorithm whose loss function can be formulated and has at least one minimum.
Also, we cleared up our understanding of the infamous gradient descent loss function equation.
We learnt how to use a visual method to estimate the number of iterations required.
Finally, we got to know the actual workings behind the scenes of the famous gradient descent algorithm.
.. More details