# Understanding Correlations with Python

because the biggest blunders in analysis can result from forgetting it.

The fact that two variables are correlated does not mean one is the cause of the increases in the other.

Informative Variables for PredictionsNow back from the hazards of “correlation not meaning causality”, we have some good news.

Even if there is no causal relationship between the two variables, one is often a good predictor of the other once they are correlated.

This means that finding variables that a correlated to your target variable is a quick way of finding informative/important variables in your machine learning predictive task.

[Feature Engineering]What are the drivers of correlation?These might help us appreciate the above discussion on correlation and causal relationships.

one variable (X) is directly responsible for the changes the other (Y) — [causality]both variables say X and Y are responding to a common variable say (Z).

So even though X and Y are correlated the actual causal relationship is between Z->X and Z->Y.

or there is some loose association between the variables.

The Maths & Properties of Correlation.

Correlation is Covariance of the two variable normalized by product of their standard deviation.

The covariance is just a fancy way of indicating the strength of the linear relationship of two variables(e.

g X,Y).

So if the value of Y increases as X increases you have a larger positive covariance and if Y decreases as X increases you have a negative covariance (see charts above).

However covariance is NOT TO SCALE, its magnitude depends on the magnitude of the two variables involved, so a higher covariance does not necessarily mean a stronger linear relationship.

Correlation rescues this being essentially:the normalised/standardised version of covariance which enables us examine the strength of linear relationship on a scale of [-1,1] independent of the magnitude of the variables themselves.

Precisely below where X and Y are the two variables and sigma_x and sigma_y is their standard deviation.

In expanded form… where E[.

] is expected/mean of the product of their deviations from the respective means mu_x and mu_yInteresting Propertiescorr(X,Y) = corr(Y,X)if the variables (X, Y) are independent corr(X,Y) =~ 0 meaning they are uncorrelated or there’s NO linear* relationship between them.

It is important to note that if corr(X,Y)=0 , it does NOT mean X,Y are independent….

they could be related via a non-linear relationship.

Correlation just like any other summary statistic (e.

g.

mean) is just an indication of linear relationship and you should inspect the data to confirm the kind of relation.

Anscombe’s quartet provides an illustration — distributions with same mean and variance and correlation but represent different patterns.

Anscombe Quartet SourceThe covariance of a random variable X and a constant c is zerocov(X,c) = 0corr(X,X) = 1 ????In some cases correlation (index) cannot be computed i.

e when cov(Y,c) where c is a constant as it involves division by zeroRelationship between Correlation, Covariance vs Variance vs Standard deviation.. More details