Statistical Overview of Linear Regression (Examples in Python)Jovan MedfordBlockedUnblockFollowFollowingMar 23In statistics we are often looking for ways to quantify relationships between factors and responses in real life.

That being said, we can largely divide the responses we want to understand into two types: categorical responses and continuous responses.

For the categorical case, we are looking to see how certain factors influence the determination of which type of response we have out of a set of options.

For example, consider a data set concerning brain tumors.

In this case, the factors would involve the size of the tumor and the age of the patient.

On the other hand, the response variable would have to be whether the tumor is benign or malignant.

These types of problems are usually called classification problems and can indeed be handled by a special type of regression called logistic regression.

For the continuous case however, we are looking to see how much our factors influence a measurable change in our response variable.

In our particular example we will be looking at the widely used Boston Housing data set which could be found in the scikit-learn libraryWhile this data set is commonly used for predictive multiple regression models, we are going to focus on a statistical treatment to come to understand how features like an increase in room size can come to affect housing value.

Model Notation and AssumptionsThe linear regression model is represented as follows:where:In order to “fit” a linear regression model, we establish certain assumptions about the random error that would ensure that we have a good approximation of the actual phenomena once these assumptions are applicable.

These assumptions are as follows:Exploratory WorkFor the purpose of this model we will be looking at MEDV as the response variable.

MEDV is the median value of owner-occupied homes in $1000'sHere we are going to be taking a look at the row of the pairplot that has MEDV on the y-axis.

That is, we are going to look at scatter plots of how the housing value relates to all the other features in the dataset.

You can find a full description of all the names of the dataset here:https://www.

cs.

toronto.

edu/~delve/data/boston/bostonDetail.

htmlAll Features vs the Median Value of Owner Occupied HomesOne particularly linear relationship that stands out is that of the average number of rooms per dwelling(RM).

Here’s a closer look:Simple Linear RegressionThe simplest form of linear regression consists only of the intercept and a single factor term.

In this case we will fit the model using the least squares method.

Least Squares MethodWhen we say ‘fit’ the model, we mean that we find estimates of the factor terms that best suit the data given.

In the regression case, that means we find the distance that minimizes the distance between our regression line and each observed data point.

That is we minimize the square errors, represented mathematically as:Where sigma is the sum over all the rows in the data set.

Here’s how we fit the model using python library statsmodel.

We first import the library:import statsmodels.

api as smWe are now ready to fit:Notice how we have to add in a column of ones called the ‘intercept’.

This is due to the fact that we can rewrite our model using linear algebra so that:So the ones column is the first column in our X matrix so that when we multiply by the factor coefficients vector we get our intercept value in each equation.

This form is what is used to extend to the multiple regression case, but we won’t be extensively covering that math in this article.

Fit Summaryslr_results.

summary()coef: These are the estimates of the factor coefficients.

Oftentimes it would not make sense to consider the interpretation of the intercept term.

For instance, in our case, the intercept term has to do with the case where the house has 0 rooms…it doesn’t make sense for a house to have no rooms.

On the other hand, the RM coefficient has a lot of meaning.

It suggests that for each additional room, you could expect an increase of $9102 median value of owner-occupied homes.

P>|t|: This is a two tailed hypothesis test where the null hypothesis is that RM has no effect on MEDV .

Since the p-value is so low it is approximately zero, then there is strong statistical evidence to reject the claim that RM has no effect on MEDVR-squared: This is the amount of variance explained by the model and is often considered to be a measure of how well it fits.

However, this factor can be thrown off as it can be artificially inflated by increasing the number of factors even if they are not significant.

For this reason we must also considere the adjusted R-squared which adjusts calculations to suit the number of factors.

However in simple linear regression these two are the same.

How our line fit’s the dataMultiple Linear Regression:What happens when we add in other factors?Seeing as we don’t have much else to do, why not just throw in all of ‘em(Disclaimer: maybe you shouldn’t try that at home….

especially if you have thousands of features….

)Except CHAS: Since it is binary*Okay…I apologize for how long that summary table was :/Note that the R-squared and adjusted R-squared increased dramatically.

Also notice however that there is now a feature with strong statistical evidence to support the claim that it does NOT affect MEDV.

That feature is AGE and we will be removing it from our model and refitting.

We will continue to trim and compare until we have all statistically significant features.

Red: Dropped, Black: R-squared, Green: Adjusted R-squaredAs we can see, the columns Age and Indus were not contributing to the fit of the model and are better left out.

Prob(F-statistic): is the p-value associated with the test of the significance of the overall model.

In this case the null hypothesis is that the model is not overall significant.

Since our value is way below 0.

01, we can reject the null hypothesis in favour of the alternative, that the model is statistically significant.

***NOTE***For Multiple linear regression, the beta coefficients have a slightly different interpretation.

For example, the RM coef suggests that for each additional room, we can expect a $3485 increase in median value of owner occupied homes, all other factors remaining the same.

Also note that the actual value of the coefficient has changed as well.

Model DiagnosticsAt this point you may be wondering what we could do to improve the fit of the model.

Due to the fact that the adj.

R-squared is 0.

730, there is certainly room for improvement.

One thing that we could look to see what new features we could add to the data set that may be important to our model.

Apart from that, we would also need to test to see if it was ever actually reasonable to apply linear regression in the first place.

For that we look to residual plots.

The benefit here is that the residual is a good estimate to the random error within the model.

That being said we would be able to plot the residual vs the feature (or predicted values) to get a feel of the distribution of the residuals.

We would expect that if the model assumptions hold that we would see a completely scattered diagram with points ranging between some constant value.

This would imply that the residuals are independent, normally distributed, have zero mean and constant variance.

standard residual plotHowever in our SLR case with RM we haveHere we can see that there is some hint of a non-linear pattern as the residual plot seems to be curved at the bottom.

This is a violation of our model assumptions where we assumed the model to have constant variance.

In fact, in this case it would seem as the variance is seeming to be changing according to some other function.

We will be using the box-cox method to deal with this issue in a second.

We can further check our normality assumption by creating a qqplot.

For a normal distribution a qqplot will tend toward a straight line.

Note however that this line is clearly skewed.

Box-Cox MethodThis is a process by where we find the maximum likelihood estimate of the most appropriate transformation that we should apply to our response value so that our data would be able have a constant variance.

Implemented using scipy.

stats, we got a lambda value of 0.

45, which we can use as 0.

5 since it won’t make a huge difference in terms of fit, but it will make our answer more interpretable.

import scipy.

stats as statsdf.

RM2,fitted_lambda = stats.

boxcox(df.

RM)Below you will find a table of common lambda values and their suggested response transformations.

Common Lambda Values — taken from statistics how-toAfter this we transform our response accordingly:df.

MEDV2 = np.

sqrt(df.

MEDV)And then we fit our model to MEDV2 which is the square root of MEDV.

Both our R-squared and adjusted R-squared values went up by quite a bit with the same amount of features.

Kinda cool right?Additionally, there has also been some improvement in our residual plots:Further, we could also try applying our transformation to each of the the feature variablesX = df[['intercept','CRIM', 'ZN','NOX', 'RM','DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']]X2 = X.

apply(np.

sqrt)We then fit our model consisting of X2 as our feature matrix and MEDV2 as our response variable.

Here we see even more improvement in these values with the same number of features.

Our final residual plot looks like this:So there is clearly still some room for improvement, but we’ve definitely made some progress.

Some Additional NotesThere are many more parts to the statistical understanding of the regression model.

For instance, it would be possible for us to come up with confidence intervals surrounding each beat coefficient or even around our predicted values.

Further, regarding CHAS, it would have been fine to include chase due it’s {1,0} encoding.

For any more options it would be difficult or impossible to interpret the beta coefficient unless there was a natural ordering to the options.

When we fit our final model with chas we get adjusted R-squared of 0.

794.

The interpretation of the beta coefficient of 0.

077 in this case would be :If the house tract bounds the Charles river, then you can expect a $77 increase in the median value of owner-occupied homes over a house that is not bounded by the river, all other factors remaining the same.

ConclusionI hope that I was able to get across the idea that regression can do more than just develop predictions based on certain features.

In fact, there is a whole world of regression analysis dedicated to using these techniques to gain deeper understanding between variables in the real world around us.

.