Hitting the ground with Linear RegressionAditya ChandupatlaBlockedUnblockFollowFollowingJan 12Statistically speaking, regression is a technique used to determine the statistical relationship between two or more variables where a change in a dependent variable is associated with, and depends on, a change in one or more independent variables.

So basically, if you were to remove out all the technical jargon, in layman’s terms, regression is a yet another name for curve-fitting.

You will be given some input data, and our task is to figure out (a.

k.

a learn) a mathematical function which when given un-seen data predicts the output value which is as much close to the real output value as possible.

By and large, this definition holds good for most, if not all, of the supervised machine learning techniques.

To solid-ify this concept, let us take an example.

We will be using Airfoil Self-Noise Dataset from UCI ML repository.

The dataset is as follows:Sample data-points from UCI airfoil self noise datasetImportant Note (Clarification): The input values for our problem are:Hertz — 1st input feature, denoted by x1Angle — 2nd input feature, denoted by x2Chord length — 3rd input feature, denoted by x3Velocity, and — 4th input feature, denoted by x4Thickness — 5th input feature, denoted by x5The output value, which we are going to predict, is: Sound Pressure — denoted by yBy the definition of regression stated previously, we need to learn a function which is capable of calculating ‘Sound Pressure’ given Hertz, Angle, Chord length, Velocity and Thickness.

Mathematically speaking, we need a function such as:h(x) = θ0 + θ1×1 + θ2×2 + θ3×3 + θ4×4 + θ5x5where, h(x) is the approximation for the actual output value ‘y’, and θ0, θ1, …, θ5 are the weights/parameters which we would like to determine by performing some magic!Now, before diving into this problem, let us take a step back and visualise how would we solve the following equation (a rather simpler one):y = m * x + cLet us suppose, we have the following data at our disposal:Image depicting the variation of height and weight of individualsIn the above picture, height is our x-variable (the input feature) and weight is our y-variable (the output feature.

)Remember, we need to do better than simply guessing ‘m’ and ‘c’ values — otherwise what’s the point of mathematics?The Cost FunctionIn order to determine ‘m’ and ‘c’, let us simply set them to 0.

Now, the equation is: y = 0We, as humans, know that it is an absolute blunder to set ‘m’ and ‘c’ to zero for the above dataset.

But how do we program the computer so that it knows how to evaluate whether a particular choice of parameters is poor or not.

This is where cost function comes into picture.

Cost function is a function which tells the computer whether a particular choice of parameters is optimal or not.

Sometimes, we will refer to the computer as a model — which will be more apt in the field of machine learning.

There are several cost functions available, but one which is in our interests, and is also most widely used in regression based tasks is the Mean Squared Error (MSE) Cost Function, denoted by J(θ).

Mean Squared Error (MSE) cost function (‘m’ is the total no:of examples and x(i), y(i) denote the i-th input example)We use this cost function to measure how our model is performing against our chosen choice of parameters.

As described previously, if we were to select ‘m’ and ‘c’ as zero, we will be getting a very high value of J(θ) — denoting that this selection is a poor choice.

Intuition behind MSE cost functionIn MSE, the core operation being performed is the difference between the predicted output value, h(x) and the actual output value, y.

This difference actually makes sense because if one wants to know how different one value is with respect to some other value, the most straightforward way to quantify such a difference is to subtract them.

Furthermore, we square the difference because the predicted value might be greater than the actual value, or vice versa.

It is crucial to remember that we would like to know the magnitude of difference, and not the direction of difference.

The square function accomplishes this.

However, there are other alternatives to get the magnitude of difference such as the modulus function (a.

k.

a absolute value).

As a side note, traditionally, this difference is also called as an error.

It is not sufficient to simply calculate how different our model’s output value is from the actual output value for one example alone.

The superscript ‘i’ denotes ‘i’-th input example, and we perform a summation over all ‘m’ input examples and divide by the total number of examples to calculate the final difference between our model’s predictions and the actual output values.

Gradient DescentOk, so finally we are at a point to discuss how the learning in machine learning takes place.

Restating again for the sake of clarity, the simplified equation which we are headed to solve is:y = m * x + cWe figured out, with the help of a cost function, how to determine whether a particular choice of parameters (‘m’ and ‘c’ here) are good or bad.

If the cost function outputs a zero value for our chosen ‘m’ and ‘c’ values, then Eureka!.We have figured out the optimal parameters for our model.

But in reality, this is far less likely to happen.

Once our cost function outputs a positive value, we need to figure out a way to modify our parameters — read as tweaking the parameters — so that the new parameters yield a lower cost function value.

If, god forbid, our cost function outputs a cost which is higher than the old value, then, clearly, something is wrong with the way we have updated our parameters.

What I have stated in the before paragraph is a rather extremely simplified example of gradient descent — the algorithm used for learning our model parameter values, ‘m’ and ‘c’.

Mathematically, gradient descent performs the minimisation of the cost function J(θ).

Let us see how the plot of J(θ) with respect to the parameter values θ0 (or ‘c’ in our discussion) and θ1 (or ‘m’ in our discussion) actually looks like:x and y axis correspond to θ0 and θ1 respectively, whereas the z axis corresponds to the cost, J(θ)Therefore, what we have done when we have chosen ‘m’ and ‘c’ values as zero, is that we have simply calculated the value of J(θ) at a particular point.

Hence, our task now boils down to a standard optimisation problem in mathematics.

Formally, we can describe gradient descent as follows: (Note: In the below figure ‘m’ is θ2 and ‘c’ is θ1)To the uninitiated, at first sight, these mathematical terms might be daunting.

You might question, are these the equations responsible for learning in machine learning?.Yes they are!There are two subtle things in the above equations which are responsible for learning: (While reading the below two points keep visualising about the 3D plot of J(θ) which we have plotted above)In which direction to descend?.The slope of a curve (or a line for that matter) gives us a positive value when the curve is increasing and a negative value when the curve is decreasing.

Furthermore, note that at a particular point, when a curve is increasing, then there will definitely be a minima towards its left, and vice-versa.

Combining these two facts we get: “When the slope of a curve is positive, then move towards left and when slope is negative, move towards right — because that’s where minima lies”How much to descend in one step?.This is our learning rate ⍺.

Intuitively, to make the algorithm run faster we might want to choose as high a value as possible.

While this logic seems to be right, there is a risk of over-shooting the minima and thereby making the algorithm slower to converge, or worse yet, diverging.

Hence the step, in a single iteration, by which the gradient descent algorithm descends is the learning rate ⍺.

Note: Learning rate ⍺ is also called as a hyper-parameter.

Hyper-parameter is a parameter whose value is set before the learning process begins.

By contrast, the values of other parameters are derived via training.

Moreover, there is no pre-defined method to choose the right set of hyper-parameters to a given model.

The only way to set a hyper-parameter is by hit-and-trial.

However, occasionally, one may come across hyper-parameter best-practises and start the learning process by choosing them.

This is why the process of finding hyper-parameters is sometimes called hyper-parameter tuning.

Optional Information: If you are curious to know how the gradient descent works (or descends) when the learning rate, alpha, is varied:The curve is a 2D representation of the above cost function J(θ) and the direction and magnitude of the arrow represents the way learning rate variesLinear Regression ModelWe have talked about cost function and gradient descent.

Let’s put them together in our linear regression model:Grab the input dataset.

Figure out what are the features, and what is the output variable which we are going to predict.

Initialise the weights of the features to zeroRepeat the following two steps until convergence:(a) Calculate the cost of our model for the selected parameters(b) Update the parameters with gradient descentAfter convergence, you get the final parameters for your model which hopefully will give you the correct predictions for unseen data.

Convergence: What do we mean by convergence?.Simply put, it is an instance where there is no point in making a parameter update using gradient descent because the cost function has remained constant for quite some time.

This statement might seem like a soft constraint instead of a hard constraint, and I cannot agree more.

Certain aspects of Machine learning is still more of an art than science, often filled with uncertainty.

In fact, this is what makes the field all the more interesting.

Practically, one can plot the variation of the cost function at every iteration of the above loop, and if the implementation is correct, you will see a curve which is more or less similar to this:Graph containing several curves of the variation of the cost function with increasing number of iterations.

Each curve represents a random initialisation of the parameter valuesEvidently, we can see that after 4000 iterations, the cost function stagnates.

This is how we determine the convergence.

Now, since we have understood what linear regression is, let us see it in practise by applying it to our Airfoil Self-Noise Dataset mentioned at the beginning of this article.

We will be using python to implement the model.

h(x) = θ0 + θ1×1 + θ2×2 + θ3×3 + θ4×4 + θ5x5From here on, you can follow along with me by referring to the IPython notebook present in my Github repository.

Here is the crux of the code:Code Flow:Read and pre-process the dataset: The dataset is present in the repository in the form of a CSV file.

We can read it via Pandas and visualise it by using matplotlib library.

For more information, refer to the Github repository link where you will find additional code for all the utility functions.

For illustration, here is the visualisation (scatter plot) of thickness (one of the input feature) and sound_pressure (the output variable.

)The primary advantage of visualisation is that, we can determine whether there are any redundant features by looking for correlation.

Moreover, we can discard few data points which might seem like outliers, or check for class imbalance if we are performing classification.

Other visualisations:Normalize the features: To ensure that one feature does not have more weightage than the other.

For now, you can assume that this is the required step in every machine learning workflow.

I will be writing another in-depth article on this later.

Until then you can refer here.

Split the dataset into training, dev, and test sets: Usually we do not use the entire dataset to perform our learning.

Ideally, 80% of the data is dedicated to training, remaining 20% is split equally between dev and test sets.

A wonderful intuition is given here.

Gradient descent: Run gradient descent to compute the optimal value of the parameters.

Note: Once again, if you were to look at the code, you will encounter another variable ‘lambd’ representing the hyper-parameter lambda.

Lambda, is a regularization term used to ensure that our model is not overfitting to the training set.

More on this in another article.

Results: Finally, we calculate the RMSE (Root Mean Squared Error) of our model on train, dev and test sets to report the results.

The intuition behind RMSE is same as that of MSE cost function.

RMSE is one of the metrics which is used to determine how well a model is performing.

An RMSE value of 0 means that the predicted result is exactly the same as the actual result, whereas a higher (non-zero) RMSE value means that the predicted result is different from that of the actual output value.

For our model built for Airfoil Self-Noise dataset, the results are:From the last example in the above image, we can see that for the datapoint:6300 Hertz5.

3 Angle0.

2286 Chord length39.

6 Velocity0.

006 ThicknessWe predicted 118.

26 as the Sound Pressure.

The actual value being 112.

54.

Pretty close! :DAlthough we have achieved an RMSE accuracy of 4.

87 on the test set, we can do much better.

In this article we have tackled a supervised learning problem using linear regression — the easiest but not necessarily the effective technique for the problem at hand.

Moreover, the model which we have designed is a linear model:h(x) = θ0 + θ1×1 + θ2×2 + θ3×3 + θ4×4 + θ5x5In reality, many relationships are non-linear.

Consider the problem of house price prediction with the area of the house for example.

If the area of the house keeps on increasing, the price need not be increasing proportionately.

If you are already familiar with Machine Learning, you might be wondering what’s the point in implementing linear regression from scratch using Numpy.

There are already several high-level libraries which do it for you.

Such as, in Tensorflow:# Estimator using the default optimizer.

estimator = LinearRegressor(feature_columns=[categorical_column_a, categorical_feature_a_x_categorical_feature_b])However, I adhere to the idea that if one wants to master a concept in computer science domain or otherwise, then he/she has to learn from the first principles.

Coming up next, logistic regression ….