Ordinary Least Squares (OLS) is a method where the solution finds all the β̂ coefficients which minimize the sum of squares of the residuals, i.

e.

minimizing the sum of these differences: (y — ŷ)², for all values of y and ŷ in the training observations.

Think of y and ŷ as column vectors with entries equal to the number of your total observations.

The fascinating piece is that OLS provides the best linear unbiased estimator (BLUE) of y under a set of classical assumptions.

That’s a bit of a mouthful, but note that:“best” = minimal variance of the OLS estimation of the true betas (i.

e.

no other linear estimator has less variance!)“unbiased” = expected value of the estimated beta-hats equal the true beta valuesClearly a BLUE estimator is desirable, yet it can also be elusive to achieve as you’ll see below.

Above I wrote y as a function of one explanatory feature and one intercept, but in practice this is going to be a multi-dimensional problem.

To make this more concrete: say you wanted to predict the Housing Price Index of the US and you have the following data:DataFrame for RegressionHere your column of data housing_price_index is your labeled y vector, and the 7 columns from sp500 to gross_domestic_product are 7 feature vectors which you can also think of as a feature matrix X.

This means we’d estimate 7 different beta coefficients for each feature plus one intercept value.

These are the OLS assumptions:Residual (error) distribution iid: mean 0, constant variance — a) expected value of each population error term is 0 (recall we are estimating the true y)!b) variance of each population error term is constant σ² (also called Homoscedasticity)c) No autocorrelation — error terms need to be independent from each other (e.

g.

day-to-day stock prices have auto-correlation)Feature matrix X has full column rank — the number of observations is greater than the number of features, and there cannot be an exact linear relationship between any two features (no multicollinearity).

Without full column rank, the matrix X is not invertible and the OLS estimator cannot be computed.

Regression is linear in the parameters — the dependent y is a linear function of the β parameters and all relevant explanatory variables need to be there to avoid omitted variable bias.

Features are independent from the error terms — this is also referred to as exogenous explanatory variables.

This means you can’t have any of the features contain explanatory info about each error term.

(Optional due to CLT) Residuals (population error terms) should be normally distributed — the reason this is often listed as optional or omitted is because for large sample sizes the CLT states that the coefficient estimates are going to be basically normal anyway (in the limit) even if the error term distribution is not normally distributed.

However, for a small number of observations the CLT may not have converged anywhere near something normally distributed yet.

In my case below I had a very small sample size.

When things go awryIn real life your data is unlikely to perfectly meet these assumptions.

In this section I’ll show an example where my base data set blatantly violates assumptions #5 and #1b above, and what I did to fix it.

My data came from funding levels of online-crowdsourced projects, and a variety of features such as campaign length, description text sentiment (via NLTK Vader), number of online photos and many more.

By simply running OLS on the features and target (dollars pledged) here’s what the residuals looked like:The red line indicates perfect normality, and clearly the residuals are not normally distributed in violation of assumption #5.

This mattered in my case as I had less than 250 observations total so was not convinced my coefficient estimates would be asymptotically normally distributed.

Next, here’s a plot to check if the residuals are spread evenly across the range of predictors (assumption #1b for equal variance):Clearly the residual errors are not spread evenly across the range of predictors, so we have issues here as well.

Data TransformationHere’s a pair plot of my untransformed data set with a few select problem features:In this case pledged was my dependent y, and num_gift_options and photo_cnt were two selected features.

While not a guarantee, it’s sometime the case that transforming features or the target to a ‘more normal looking’ distribution can help with the problematic OLS assumptions mentioned above.

In this case, the pledged amount y is begging for a transformation to log space.

Its individual y values take on anything from $2 to $80000.

In my case pledged was in a Pandas DataFrame, so I converted the entire column via numpy’s log function:Transform Pandas DataFrame column to log valuesThis resulted in the following transformation:One quick aside is that when you transform y to log space, you’ll implicitly end up interpreting unit changes in X as having a percentage change interpretation in the original non-log y at the end.

The answer in this StackOverflow thread has a very clear explanation of why this is the case by using the property of the natural log’s derivative.

Now on to the features.

I’ve found the Box-Cox transformation to help immensely with regard to fixing residual normality.

If you look at the center box in the pair plot above, you’ll see the un-transformed distribution of the number of gift options.

Here’s how to run a Box-Cox transformation of taht using scipy.

stats:Box-Cox Transformation with best lambda parameterNote here that the stats.

boxcox_normax function from scipy.

stats will find the best lambda to use in the power transformation.

Here’s how it looks post-transformation:If the feature in question has zero or negative values, neither the log transform or the box-cox will work.

Thankfully, the Yeo-Johnson power transformation solves for this case explicitly.

Conveniently, the Yeo-Johnson method is the default case in sklearn.

preprocessing’s PowerTransformer:Here’s what that looks like post-transformation:While the transformed features are by no means normally distributed themselves, look at what we get for our residual distribution and variance plots post-transformation:This is night and day from where we started, and we can now say that we have essentially normally distributed residuals and constant variance among the residuals.

Hence these OLS assumptions and we can be more confident in a BLUE estimator.

Model Testing and InterpretationThis is by no means the end point of the analysis.

In this specific case, I ended up running a 3-fold Cross-Validation testing out Linear Regression, Ridge Regression, and Huber Regression on a validation split of my training data, and then finally testing the winner on the held-out test data to see if the model generalized.

The overall point is that it’s best to make sure you have met the OLS assumptions before going into a full train/validation/test loop on a number of models for the regression case.

One note is that when you transform a feature, you lose the ability to interpret the coefficients effect on y at the end.

For example, I did not transform the project length feature in this analysis, and at the end I was able to say that a unit increase (+1 day) in project length led to an 11% decrease in funding amount.

Since I used these transformations on the photo count and number of gift options features, I can’t make the same assertion given a unit increase in X, as the coefficient predictions are relative to the transformation.

Thus transformations do have a downside, but it’s worth it to know you’re get a BLUE estimator via OLS.

.. More details