Photo by John Moeses Bauan on UnsplashAll the algebra you need to know about Linear Regression to be interview-readyNathan ToubianaBlockedUnblockFollowFollowingMay 23Linear Regression (LR) is one of the most simple and important algorithm there is in data science.
Whether you’re interviewing for a job in data science, data analytics, machine learning or quant research, you might end up having to answer specific algebra questions about LR.
Here’s all you need to know to feel confident about your LR knowledge.
Note: this article is all about theory, no coding.
It also assumes that you are already at least a little familiar with algebra.
Know your assumptionsAlthough this article assumes that you’re already somewhat familiar with LR, let’s remind the formula and assumptions.
Given N observations, an output vector Y (dimension Nx1) and p inputs X1, X2, …, Xp (each input vector being of dimension Nx1), LR assumes that the regression function E(Y|X) is linear in the inputs.
Y thus statisfies:where epsilon is the error terms.
The linear assumption is actually the only assumption strictly necessary for LR later in this article we’ll se that we can add more assumptions to infer more results.
Although this above formula may seem simple, it’s not always easy to find the coefficients (beta) — moving forward, we’ll call hat betas our estimates of the coefficients.
Know the key metrics definitionsHere are the 3 metrics you definitely need to know (by heart):The RSS is the Residual Sum of Squareswhere y_i is the output for observation i and ŷ_i is the estimated output for observation i:TSS is the Total Sum of SquareswhereR-squared, which is a measure of the linear relationship betwween X and YAs long as you know these formulas you should be able to infer all the other results by logical reasoning.
Know the solution: how (and when) to compute itAs we said before, the key of LR is finding estimations of the coefficents.
We find these by minimizing the RSS.
To do so, let’s define X and Y byNote that we have to add a column of 1s to the input matrix in order to take into account the intercept beta_0.
Our minimization problem is equivalent to solving:So we can compute the gradient of the right term:Which is supposed to be equal to 0 for our estimation beta hat.
Assuming that X⊤X is non singular, this gives us an explicit solution:This is a formula you need to know, but you should also be able to prove it, as done above.
The assumption of non singularity is key here.
We also infer the formula of y estimated:It can also be useful to know the explicit solution in dimension Nx1 (1 input variable):Where x is here a vector (not a matrix) — it’s easy to remember when you already have the more general solution: X⊤X becomes the variance of the input (in 1-dimension, inverting a term is equivalent to dividing by that term) and X⊤y becomes the covariance term.
You can also compute this solution by doing a similar calculus, in dimension 1.
Be comfortable with hypothesis testing (assuming normal errors)In an interview, it’s also important to have some statistical notions for LR.
This section assumes that you have the basics in statistical tests (including t stats, f-stats and hypothesis testing).
Assuming normal errors, i.
e(let’s not forget that epsilon is a vector here!) then our estimated solution satisfies:and thuswhich brings us toThis finding helps evaluating the null hypothesis that a coefficient beta_j is null: we can compute the t-scoreUnder the null hypothesis, t_j follows a t-distribution with N-p-1 degrees of freedom, and for N large enough, follows a normal distribution.
Computing this score can help you assess the null hypothesis.
For example, finding a |t_j| score higher than 1.
96 will ensure significance at the 5% level that the coefficient beta_j is not null.
You can also compute confidence intervals for a given coefficient: an approximate (1–2*alpha)-confidence interval is given bywhere we use the areas under the standard normal curve to compute:We can also test the hypothesis that every coeffcicient is null by computing the F-statwhich under the null hypothesis, follows an F(p, N-p-1) distribution.
Hence, large values of F constitue evidence against the null hypothesis.
Bonus 1: why do we do least squares?When we looked for the optimal estimation of beta, we instinctively jumped to least squares optimization, but why?.First, we can prove (we won’t here, check this article for more details) that the least squares estimator is unbiased, i.
eNext, there’s actually a theorem that proves that the least squares estimator has the smallest variance.
This is the Gauss-Markov theorem: it shows that the best unbiased estimator is the least squares.
Bonus 2: what to do when X⊤X is not fully ranked?First, let’s see when this happens.
Let’s remind ourselves that X is of dimension (N, p+1).
X⊤X is then (p+1, p+1).
We can show that X⊤X is fully ranked if and only if X is of rank p+1, which forces N>p.
You can see the proof here.
This means that the features are linearly independent (but not necessarily uncorrelated).
When this is not the case (when we have more features than observations), we can use shrinkage methods such as ridge regression.
Indeed, when we add a term to the diagonal of X⊤X, the problem becomes feasible.
For the example of ridge regression:has a solution:indeed, because X⊤X, is positive semi-definite, its eigenvalues are all positive, adding a positive term to the diagonal makes it fully ranked.
So this makes the problem non-singular.
This is one the reasons we have to apply regularization techniques for cases when p>>N.
.. More details