We should not be able use a linear model to accurately predict one feature using another one.
Let’s take X1 and X2 as examples of features.
It could happen that X1 = 2 + 3 * X2, which violates the assumption.
One scenario to watch out for is the ‘dummy variable trap’ — when we use dummy variables to encode a categorical feature and do not omit the baseline level from the model.
This results in perfect correlation between the dummy variables and the constant term.
Multicollinearity can be present in the model, as long as it is not ‘perfect’.
In the former case, the estimates are less efficient, but still unbiased.
The estimates will be less precise and highly sensitive to particular sets of data.
We can detect multicollinearity using variance inflation factor (VIF).
Without going into too much details, the interpretation of VIF is as follows: the square root of a given variable’s VIF shows how much larger the standard error is, compared with what it would be if that predictor were uncorrelated with the other features in the model.
If no features are correlated, then all values for VIF will be 1.
%%Rlibrary(car)vif(lin_reg)To deal with multicollinearity we should iteratively remove features with high values of VIF.
A rule of thumb for removal could be VIF larger than 10 (5 is also common).
Another possible solution is to use PCA to reduce features to a smaller set of uncorrelated components.
Tip: we can also look at correlation matrix of features to identify dependencies between them.
Homoscedasticity (equal variance) of residualsWhen residuals do not have constant variance (they exhibit heteroscedasticity), it is difficult to determine the true standard deviation of the forecast errors, usually resulting in confidence intervals that are too wide/narrow.
For example, if the variance of the residuals is increasing over time, confidence intervals for out-of-sample predictions will be unrealistically narrow.
Another effect of heteroscedasticity might also be putting too much weight to a subset of data when estimating coefficients — the subset in which the error variance was largest.
To investigate if the residuals are homoscedastic, we can look at a plot of residuals (or standardized residuals) vs.
predicted (fitted) values.
What should alarm us is the case when the residuals grow either as a function of predicted value or time (in case of time series).
We can also use two statistical tests: Breusch-Pagan and Goldfeld-Quandt.
In both of them the null hypothesis assumes homoscedasticity and a p-value below a certain level (like 0.
05) indicates we should reject the null in favor of heteroscedasticity.
In the snippets below I plot residuals (and standardized ones) vs.
fitted values and carry out the two mentioned tests.
To identify homoscedasticity in the plots, the placement of the points should be random and no pattern (increase/decrease in values of residuals) should be visible — the red line in the R plots should be flat.
We can see that this is not the case for our dataset.
The results indicate that the assumption is not satisfied and we should reject the hypothesis of homoscedasticity.
Potential solutions:log transformation of dependent variablein case of time series, deflating a series if it concerns monetary valueusing ARCH (auto-regressive conditional heteroscedasticity) models to model the error variance.
An example might be stock market, where data can exhibit periods of increased or decreased volatility over time (volatility clustering, see this article for more information)No autocorrelation of residualsThis assumption is especially dangerous in time-series models, where serial correlation in the residuals implies that there is room for improvement in the model.
Extreme serial correlation is often a sign of a badly mis-specified model.
Another reasons for serial correlation in the residuals could be violation of the linearity assumption or due to bias that is explainable by omitted variables (interaction terms or dummy variables for identifiable conditions).
An example of the former case might be fitting a (straight) line to data, which exhibits exponential growth over time.
This assumption also has meaning in case of non-time-series models.
If residuals always have the same sign under particular conditions, it means that the model systematically underpredicts/overpredicts what happens when the predictors have a particular configuration.
To investigate if autocorrelation is present, I use ACF (autocorrelation function) plots and Durbin-Watson test.
In the former case, we want to see if the value of ACF is significant for any lag (in case of no time-series data, row number is used).
While calling the function, we indicate the significance level (see this article for more details) we are interested in and the critical area is plotted on the graph.
Significant correlations lie outside of that area.
Note: when dealing with data without the time dimension, we can alternatively plot the residuals vs.
the row number.
In such cases rows should be sorted in a way that (only) depends on the values of the feature(s).
The second approach is using the Durbin-Watson test.
I do not go into details how it is constructed, but provide high level overview.
The test statistic provides a test for significant residual autocorrelation at lag 1.
The DW statistic is approximately equal to 2(1-a), where a is the lag 1 residual autocorrelation.
The DW test statistic is located in the default summary output of statsmodels’s regression.
Some notes on the Durbin-Watson test:the test statistic always has value between 0 and 4value of 2 means that there is no autocorrelation in the samplevalues < 2 indicate positive autocorrelation, values > 2 negative one.
Potential solutions:in case of minor positive autocorrelation, there might be some room for fine-tuning the model, for example, adding lags of the dependent/independent variablessome seasonal components might not be captured by the model, account for them using dummy variables or seasonally adjust the variablesif DW < 1 it might indicate a possible problem in model specification, consider stationarizing time-series variables by differencing, logging, and/or deflating (in case of monetary values)in case of significant negative correlation, some of the variables might have been overdifferenceduse Generalized Least Squaresinclude a linear (trend) term in case of a consistent increasing/decreasing pattern in the residuals4.
Other assumptionsBelow I present some of the other commonly verified assumptions of linear regression.
The features and residuals are uncorrelatedTo investigate this assumption I check the Pearson correlation coefficient between each feature and the residuals.
Then report the p-value for testing lack of correlation between the two considered series.
I cannot reject the null hypothesis (lack of correlation) for any pair.
The number of observations must be greater than number of featuresThis one is pretty straightforward.
We can check the shape of out data by using shape method in Python or dim function in R.
Also, a rule of thumb says that we should have more than 30 observations in the dataset.
This is taken from the Central Limit Theorem, which states that adding IID random variable results in a normalized distribution when the sample size is greater than 30, even when the random variables are not Gaussian themselves.
There must be some variability in featuresThis assumption states that there must be some variance in the features, as a feature that has a constant value for all or majority of observations might not be a good predictor.
We can check this assumption by simply checking the variance of all features.
X.
apply(np.
var, axis=0)In caret package in R there is a function called nearZeroVar for identifying features with zero or near-zero variance.
%%Rlibrary(caret)apply(X, 2, var)nearZeroVar(X, saveMetrics= TRUE)Normality of residualsWhen this assumption is violated, it causes problems with calculating confidence intervals and various significance tests for coefficients.
When the error distribution significantly departs from Gaussian, confidence intervals may be too wide or too narrow.
Some of the potential reasons causing non-normal residuals:presence of a few large outliers in datathere might be some other problems (violations) with the model assumptionsanother, better model specification might be better suited for this problemTechnically, we can omit this assumption if we assume instead that the model equation is correct and our goal is to estimate the coefficients and generate predictions (in the sense of minimizing mean squared error).
However, normally we are interested in making valid inferences from the model or estimating the probability that a given prediction error will exceed some threshold in a particular direction.
To do so, the assumption about normality of residuals must be satisfied.
To investigate this assumption we can look at:QQ plots of the residuals (a detailed description can be found here).
For example a bow-shaped pattern of deviations from the diagonal implies that the residuals have excessive skewness (i.
e.
, the distribution is not symmetrical, with too many large residuals in one direction).
S-shaped pattern of deviations implies excessive kurtosis of the residuals — there are either too many or two few large errors in both directions.
use statistical tests such as the Kolmogorov-Smirnov test, the Shapiro-Wilk test, the Jarque-Bera test and the Anderson-Darling testFrom the results above we can infer that the residuals do not follow Gaussian distribution — from the shape of the QQ plot, as well as rejecting the null hypothesis in all statistical tests.
The reason why Kolmogorov-Smirnov from ols_test_normality shows different results is that it does not run the `two-sided` version of the test.
Potential solutions:nonlinear transformation of target variable or featuresremove/treat potential outliersit can happen that there are two or more subsets of the data having different statistical properties, in which case separate models might be considered5.
Bonus: OutliersThis is not really an assumption, however the existence of outliers in our data can lead to violations of some of the above-mentioned assumptions.
That is why we should investigate the data and verify if some extreme observations are valid and important for our research or merely some errors which we can remove.
I will not dive deep into outlier detection methods as there are already many articles about them.
A few potential approaches:Z-scorebox plotLeverage — measure of how far away the feature values of a point are from the values of the different observations.
High-leverage point is a point at extreme values of the variables, where lack of nearby observations makes the fitted regression model pass close to that particular point.
Cook’s distance — measure of how deleting an observation impacts the regression model.
It makes sense to investigate points with high Cook’s distances.
Isolation Forest — for more details see this article6.
ConclusionsIn this article I showed what are the assumptions of linear regression, how to verify if they are satisfied and what are potential steps we can take to fix the underlying issues with the model.
The code used for this article can be found on GitHub.
Referenceshttp://people.
duke.
edu/~rnau/testing.
htm.