And how do we find it?Cost functionIn the univariate case, each predicted value of y (denoted as y with a little hat), or the “hypothesis” for y (h(x)), is given by the equation:The objective is to find b₁ and b₀ such that the cost function, in this case the mean squared error (MSE), is minimized:Below is a function to find the cost of given values for b₁ and b₀ in Python:def cost(b0, b1): Yhat = b0+(b1*X) SE = (Yhat-Y)**2 N = len(Y) cost = (1/(2*N))*(sum(SE)) return(cost)Inputting different values for b₁ and plotting the returned cost function value shows the convex nature of the linear regression cost function.

beta1s = np.

arange(0,10.

5,0.

5)costs = []for b1 in beta1s: costs.

append(cost(0,b1))ax = sns.

lineplot(x=beta1s, y=costs)Cost function for various values of b₁In this case, there is a local minimum at some b₁ value between 3 and 4.

Of course, I have fixed b₀ at zero, so we cannot know if this is the optimal combination of these weights.

While we cannot solve this problem geometrically, there are luckily mathematical methods to optimize these two parameters (and hence minimize the cost function), discussed below.

Method 1: Least SquaresThe least squares method minimizes the distance between each point (xᵢ, yᵢ) and the predicted values of y (which fall on the line).

We can calculate b₁ (the weight of x₁, in this case the slope of the line since we are in 2-dimensions), with the following formula:The intercept of the line, or b₀, is calculated by taking the mean of all xᵢ, multiplying by b₁, and subtracting the product from the mean of all yᵢ:The parameters b₁ and b₀ can be calculated in Python using the following codes:b1 = sum((X-X.

mean())*(Y-Y.

mean()))/sum((X-X.

mean())**2)b0 = Y.

mean()-(b1*X.

mean())Which returns the values b₁ = 9.

102 and b₀ = -34.

671.

Plugging these values in the cost function defined above returns C(-34.

671, 9.

102) = 21.

801.

Method 2: Gradient DescentWhile the Least Squares method is fairly straightforward in the univariate case (i.

e.

with one feature, x), with multiple regression (i.

e.

multiple features) it becomes much more computationally complex.

Enter gradient descent and partial derivatives!Gradient descent begins by setting some initial value for b₁ and b₀ (say, 1 and 1) and subtracting the partial derivative with respect to b₁ and b₀ (multiplied by some learning rate, alpha).

This process is repeated until it converges at some optimal b₁ and b₀ where the slope of the cost function is zero (i.

e.

the global minimum).

In more human terms, picture standing on the top of a hill with the objective of finding the lowest point in the valley (marked with a black X).

If you leap down the hill in long strides, you might overshoot the point you are aiming for.

If you take baby steps, it will take you a long time to reach the bottom.

Step size is equivalent to alpha, or the learning rate, and there is a tradeoff between overshooting the minimum and computation time.

Mathematically, gradient descent will repeat the following until convergence:For linear regression, given the cost function:Partial derivatives, with respect to b₀ and b₁, are:def gradient_descent(alpha, b0, b1, epochs): for _ in range(epochs): Yhat = b0+(b1*X) partial_0 = (1/N)*sum(Yhat-Y) partial_1 = (1/N)*sum((Yhat-Y)*X) temp0 = b0-alpha*partial_0 temp1 = b1-alpha*partial_1 b0 = temp0 b1 = temp1 return(b0,b1)In this case, a learning rate (alpha) of 0.

01, repeated 40000 times, gets us to b₁ = 9.

056 and b₀ = -34.

377, with a cost function value of 21.

801 (just as above).

The following code overlays the regression line with the gradient descent values for b₁ and b₀ onto the scatterplot of number of rooms and house price:Yhat = b1*X+b0grid = sns.

JointGrid(X, Y, space=0)grid.

fig.

set_figwidth(7)grid.

fig.

set_figheight(4)grid.

plot_joint(plt.

scatter)grid.

ax_joint.

plot(X, Yhat, 'r-', linewidth = 2)Regression line (red) calculated using gradient descentSumming upThis post discussed the basic building blocks of one of the fundamental algorithms in Data Science and Machine Learning: Linear Regression.

I discussed two methods to optimize the regression weights, or parameters, used to predict the target value, y.

While the Least Squares method is fairly straightforward with one feature (x), gradient descent is utilized more often with large feature sets due to the greater complexity of computing all weights simultaneously.

The purpose of this post is to give some insight into what is going on “under the hood” with bread-and-butter algorithms, such as Linear Regression, to ensure proper usage and interpretation.

I hope you found this post useful and as usual feel free to reach out in the comments!.And stay tuned for further posts in this series as I continue to dive into the Mathematics of our most foundational algorithms.

Sources:Coursera | Online Courses From Top Universities.

Join for Free1000+ courses from schools like Stanford and Yale – no application required.

Build career skills in data science…www.

coursera.

orgMathematics of simple regressionhttp://regressit.

com.

The linear regression version runs on both PC's and Macs and has a richer and easier-to-use…people.

duke.

eduLinear Regression Algorithm from scratch in Python | Edureka4.

1K Views Become a Certified Professional A linear regression is one of the easiest statistical models in machine…www.

edureka.

co.