Featuring Engineering in Python: Variable distribution

In variables with a normal distribution, the observations of X available to predict Y vary across a greater range of values, that is, the values of X are “spread” over a greater range.

See Figure 1 below.

# simulation of the Gaussian distribution of a variable X# x-axis indicates the values of X# y-axis indicates the frequency of each value# x is spread over a big range (-3,3)import matplotlib.

pyplot as pltimport numpy as npimport matplotlib.

mlab as mlabimport mathmu = 0variance = 1sigma = math.

sqrt(variance)x = np.

linspace(mu-3*variance,mu+3*variance, 100)plt.

plot(x,mlab.

normpdf(x, mu, sigma))plt.

show()In the Gaussian distribution depicted in the figure above, the values of the individual observations vary greatly across a wide range of x values (in this case -2 to 2).

In variables with skewed distributions, the majority of the observations available to predict Y, vary within a very narrow value range of X, and very few observations are available in the tails of the distribution.

See Figure 2 below.

from scipy import linspacefrom scipy import pi,sqrt,expfrom scipy.

special import erffrom pylab import plot,showdef pdf(x): return 1/sqrt(2*pi) * exp(-x**2/2)def cdf(x): return (1 + erf(x/sqrt(2))) / 2def skew(x,e=0,w=1,a=0): t = (x-e) / w return 2 / w * pdf(t) * cdf(a*t)n = 2**10e = 1.

0 # locationw = 2.

0 # scalex = linspace(-10,10,n)# This simulation shows a variable X skewed to the left.

The majority of the values of X are accumulated# on the right hand side of the plot, with only a few observations on the left tail.

# This meas, that we do not have enough values of X on the left, for our prediction model to learn from# more precisely, there are a lot of observations for the value range of x (0,5), but very few for the range (-10, 0)p = skew(x,e,w,5)plot(x,p)plt.

show()In the skewed distribution above we see that the majority of the observations take values over a very narrow value range (2–2.

5 in this example), with few observations taking higher values (> 5 in this example).

Therefore, regardless of the outcome, most observations will have values in the 2–2.

5 space, making discrimination and outcome prediction difficult.

# overlay of a normal distribution (yellow), with 2 skewed distributions (green and blue)for a in [-5,0,5]: p = skew(x,e,w,a) plot(x,p)show()In the figure above, we can see more clearly, how in a variable with a Gaussian distribution, the different observations can take values in a wider value range, than in a skewed variable, where the majority of values are concentrated on one end.

What can we do if variables are skewed?If the performance o the machine learning model is poor due to the skewed distribution of the variables, there are two strategies available to improve performance:Find a transformation of the variable X that stabilizes the variance and generates more consistent support across the range of values (a transformation that gives the variable more of the bell-shape of the Gaussian Distribution).

Choose an appropriate binning transformation (discretization) in order to enable each portion of the predictors’ ranges to be weighted appropriately.

In this notebook, I will discuss some diagnostic methods to determine whether the variables have an approximately normal distribution, and make some experiments to see how normality affects the performance of machine learning algorithms.

Photo by Руслан Гамзалиев on UnsplashReal Life examplePredicting Survival on the Titanic: understanding social behavior and beliefsPerhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board.

Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive.

Some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Therefore, we can learn about the society priorities and privileges at the time.

Predicting Sale Price of HousesThe problem at hand aims to predict the final sale price of homes based on different explanatory variables describing aspects of residential homes.

Predicting house prices is useful to identify fruitful investments, or to determine whether the price advertised for a house is over or underestimated, before making a buying judgment.

Titanic datasetimport pandas as pdimport numpy as npimport matplotlib.

pyplot as plt% matplotlib inlineimport pylab import scipy.

stats as stats# for regression problemsfrom sklearn.

linear_model import LinearRegressionfrom sklearn.

ensemble import RandomForestRegressor# for classificationfrom sklearn.

linear_model import LogisticRegressionfrom sklearn.

ensemble import RandomForestClassifierfrom sklearn.

svm import SVC# to split and standarize the datasetsfrom sklearn.

model_selection import train_test_splitfrom sklearn.

preprocessing import StandardScaler# to evaluate regression modelsfrom sklearn.

metrics import mean_squared_error# to evaluate classification modelsfrom sklearn.

metrics import roc_auc_score# load the numerical variables of the Titanic Datasetdata = pd.

read_csv('titanic.

csv', usecols = ['Pclass', 'Age', 'Fare', 'Survived'])data.

head()Age# plot the histograms to have a quick look at the distributionsdata.

Age.

hist()One way of assessing whether the distribution is approximately normal is to evaluate the Quantile-Quantile plot (Q-Q plot).

In a QQ-plot, the quantiles of the variable are plotted on the vertical axis, and the quantiles of a specified probability distribution (Gaussian distribution) are on the horizontal axis.

The plot consists of a series of points that show the relationship between the real data and the specified probability distribution.

If the values of a variable perfectly match the specified probability distribution, the points on the graph will form a 45-degree line.

# let's plot the Q-Q plot for the variable Age.

temp = data.

dropna(subset=['Age'])stats.

probplot(temp.

Age, dist="norm", plot=pylab)pylab.

show()The majority of the observations lie on the 45 degree red line following the expected quantiles of the theoretical Gaussian distribution.

Some observations at the lower end of the value range depart from the red line, and this is consistent with the slight shift towards the left in the variable distribution observed in the histogram above.

# let's apply a transformation and see what it does to the distribution(data.

Age**(1/1.

5)).

hist()# and now the effect of the transformation on the Q-Q plotstats.

probplot((temp.

Age**(1/1.

5)), dist="norm", plot=pylab)pylab.

show()The variable transformation did not end up in Gaussian distribution of the transformed Age values.

Fare# let’s have a look at the Fare variabledata.

Fare.

hist()# and the Q-Q plotstats.

probplot(data.

Fare, dist="norm", plot=pylab)pylab.

show()Both from the histogram and from the Q-Qplot it is clear that Fare does not follow a Gaussian distribution.

# and now let's apply a transformation(data.

Fare**(1/4)).

hist()# and the Q-Q plotstats.

probplot((data.

Fare**(1/4)), dist=”norm”, plot=pylab)pylab.

show()Model performance with original and transformed variables# let's add the transformed variables to the datasetdata['Fare_transformed'] = data.

Fare**(1/4)data['Age_transformed'] = data.

Age**(1/1.

5)# let's separate into training and testing setX_train, X_test, y_train, y_test = train_test_split(data.

fillna(0), data.

Survived, test_size=0.

3,random_state=0)X_train.

shape, X_test.

shape((623, 6), (268, 6))# let's scale the featuresscaler = StandardScaler()X_train_scaled = scaler.

fit_transform(X_train)X_test_scaled = scaler.

transform(X_test)Logistic Regression# model build using the natural distributionslogit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularizationlogit.

fit(X_train[['Age', 'Fare']], y_train)print('Train set')pred = logit.

predict_proba(X_train[['Age', 'Fare']])print('Logistic Regression roc-auc: {}'.

format(roc_auc_score(y_train, pred[:,1])))print('Test set')pred = logit.

predict_proba(X_test[['Age', 'Fare']])print('Logistic Regression roc-auc: {}'.

format(roc_auc_score(y_test, pred[:,1])))Train setLogistic Regression roc-auc: 0.

6863462831608859Test setLogistic Regression roc-auc: 0.

7137499999999999# model built using the transformed variableslogit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularizationlogit.

fit(X_train[['Age_transformed', 'Fare_transformed']], y_train)print('Train set')pred = logit.

predict_proba(X_train[['Age_transformed', 'Fare_transformed']])print('Logistic Regression roc-auc: {}'.

format(roc_auc_score(y_train, pred[:,1])))print('Test set')pred = logit.

predict_proba(X_test[['Age_transformed', 'Fare_transformed']])print('Logistic Regression roc-auc: {}'.

format(roc_auc_score(y_test, pred[:,1])))Train setLogistic Regression roc-auc: 0.

686790958981367Test setLogistic Regression roc-auc: 0.

7275595238095238Using transformed variables improved the performance of the Logistic Regression model (compare test roc auc: 0.

7137 vs 0.

7275).

# model build using natural distributionsSVM_model = SVC(random_state=44, probability=True)SVM_model.

fit(X_train[['Age', 'Fare']], y_train)print('Train set')pred = SVM_model.

predict_proba(X_train[['Age', 'Fare']])print('Logistic Regression roc-auc: {}'.

format(roc_auc_score(y_train, pred[:,1])))print('Test set')pred = SVM_model.

predict_proba(X_test[['Age', 'Fare']])print('Logistic Regression roc-auc: {}'.

format(roc_auc_score(y_test, pred[:,1])))# model built on transformed variablesSVM_model = SVC(random_state=44, probability=True)SVM_model.

fit(X_train[[‘Age_transformed’, ‘Fare_transformed’]], y_train)print(‘Train set’)pred = SVM_model.

predict_proba(X_train[[‘Age_transformed’, ‘Fare_transformed’]])print(‘Logistic Regression roc-auc: {}’.

format(roc_auc_score(y_train, pred[:,1])))print(‘Test set’)pred = SVM_model.

predict_proba(X_test[[‘Age_transformed’, ‘Fare_transformed’]])print(‘Logistic Regression roc-auc: {}’.

format(roc_auc_score(y_test, pred[:,1])))For SVM the transformation of the variables also improved the performance of the model.

Not only has the SVM now a better generalization on the test set, but the mode built using the transformed variables also does not over-fit to the train set (compare train roc-auc 0.

927 vs 0.

726 for training set).

Random Forests# model build using natural distributionsrf = RandomForestClassifier(n_estimators=700, random_state=39)rf.

fit(X_train[[‘Age’, ‘Fare’]], y_train)print(‘Train set’)pred = rf.

predict_proba(X_train[[‘Age’, ‘Fare’]])print(‘Random Forests roc-auc: {}’.

format(roc_auc_score(y_train, pred[:,1])))print(‘Test set’)pred = rf.

predict_proba(X_test[[‘Age’, ‘Fare’]])print(‘Random Forests roc-auc: {}’.

format(roc_auc_score(y_test, pred[:,1])))# model built on transformed variablesrf = RandomForestClassifier(n_estimators=700, random_state=39)rf.

fit(X_train[['Age_transformed', 'Fare_transformed']], y_train)print('Train set')pred = rf.

predict_proba(X_train[['Age_transformed', 'Fare_transformed']])print('Random Forests roc-auc: {}'.

format(roc_auc_score(y_train, pred[:,1])))print('Test set')pred = rf.

predict_proba(X_test[['Age_transformed', 'Fare_transformed']])print('Random Forests roc-auc: {}'.

format(roc_auc_score(y_test, pred[:,1])))As expected, Random Forests did not see a benefit from transforming the variables to a more Gaussian like distribution.

House Sale DatasetIn [ ]:# let's load the House Sale Price dataset with a few columns​cols_to_use = ['LotArea', 'BsmtFinSF1','GrLivArea', 'OpenPorchSF', 'YearBuilt', 'SalePrice']data = pd.

read_csv('houseprice.

csv', usecols=cols_to_use)data.

head()In [ ]:# let's check for missing datadata[cols_to_use].

isnull().

sum()In [ ]:# let's plot the histograms to have an impression of the distribution of the numerical variablesfor col in cols_to_use:fig = data[col].

hist(bins=50)fig.

set_xlabel(col)fig.

set_label('Number of houses')plt.

show()In [ ]:We observed that the numerical variables are not normally distributed.

In particular, most of them apart from YearBuilt are skewed.

In [ ]:# let's apply a transformation and see whether the variables are now more Gaussian shaped# on top of the histograms we plot now as well the Q-Q plots​for col in cols_to_use:if col not in ['SalePrice', 'YearBuilt']:data[col+'_transformed'] = data[col]**(1/4)fig = data[col+'_transformed'].

hist(bins=50)fig.

set_xlabel(col)fig.

set_ylabel('Number of houses')plt.

show()stats.

probplot(data[data[col+'_transformed']!=0][col+'_transformed'], dist="norm", plot=pylab)plt.

show()​We see that in the Q-Q plots, that most of the observations (blue dots) lie on the 45 degree line.

Therefore the transformation was successful in attributing a Gaussian-like shape to the variables.

In [ ]:# tried a few transformation on the variable YearBuilt without major success, you can go ahead and try other transformationsdata['YearBuilt_transformed'] = data['YearBuilt']**(1/-2)fig = data['YearBuilt_transformed'].

hist(bins=50)fig.

set_xlabel(col)plt.

show()In [ ]:stats.

probplot(data['YearBuilt_transformed'], dist="norm", plot=pylab)pylab.

show()However, the transformation of the variable ‘YearBuilt_transformed’ does not help to get a more ‘Gaussian’-like distribution.

Machine learning model performance on original vs transformed variablesIn [ ]:# let's separate into training and testing setX_train, X_test, y_train, y_test = train_test_split(data.

fillna(0), data.

SalePrice, test_size=0.

3,random_state=0)X_train.

shape, X_test.

shapeIn [ ]:X_train.

head()In [ ]:y_train.

head()In [ ]:# create a list with the untransformed columnscols_to_use = cols_to_use[0:-1]cols_to_useIn [ ]:# create a list with the transformed columnscols_transformed = [col+'_transformed' for col in cols_to_use]cols_transformedIn [ ]:# let's standarise the datasetscaler = StandardScaler()X_train_o = scaler.

fit_transform(X_train[cols_to_use])X_test_o = scaler.

transform(X_test[cols_to_use])In [ ]:# let's standarise the datasetscaler = StandardScaler()X_train_t = scaler.

fit_transform(X_train[cols_transformed])X_test_t = scaler.

transform(X_test[cols_transformed])Linear RegressionIn [ ]:linreg = LinearRegression()linreg.

fit(X_train_o, y_train)print('Train set')pred = linreg.

predict(X_train_o)print('Linear Regression mse: {}'.

format(mean_squared_error(y_train, pred)))print('Test set')pred = linreg.

predict(X_test_o)print('Linear Regression mse: {}'.

format(mean_squared_error(y_test, pred)))print()In [ ]:cols_transformed = [col+'_transformed' for col in cols_to_use]linreg = LinearRegression()linreg.

fit(X_train_t, y_train)print('Train set')pred = linreg.

predict(X_train_t)print('Linear Regression mse: {}'.

format(mean_squared_error(y_train, pred)))print('Test set')pred = linreg.

predict(X_test_t)print('Linear Regression mse: {}'.

format(mean_squared_error(y_test, pred)))print()We can see that variable transformation improved the model performance, the mse on the test set is smaller when using the Linear Regression model built on the transformed variables (2.

7e6 vs 2.

4e6).

In addition, the mse on the train set is bigger, suggesting that the model built using the natural distributions is over-fitting to the train set (mse 1.

6e6 vs 1.

8e6).

In [ ]:rf = RandomForestRegressor(n_estimators=5, random_state=39, max_depth=2,min_samples_leaf=100)rf.

fit(X_train_o, y_train)print('Train set')pred = rf.

predict(X_train_o)print('Random Forests mse: {}'.

format(mean_squared_error(y_train, pred)))print('Test set')pred = rf.

predict(X_test_o)print('Random Forests mse: {}'.

format(mean_squared_error(y_test, pred)))print()print()​rf = RandomForestRegressor(n_estimators=5, random_state=39, max_depth=2,min_samples_leaf=100)rf.

fit(X_train_t, y_train)print('Train set')pred = rf.

predict(X_train_t)print('Random Forests mse: {}'.

format(mean_squared_error(y_train, pred)))print('Test set')pred = rf.

predict(X_test_t)print('Random Forests mse: {}'.

format(mean_squared_error(y_test, pred)))print()print()As expected the Random Forests is not affected by variable transformation.

As expected, Random Forests did not see a benefit from transforming the variables to a more Gaussian like distribution.

House Sale Dataset# let's load the House Sale Price dataset with a few columnscols_to_use = ['LotArea', 'BsmtFinSF1','GrLivArea', 'OpenPorchSF', 'YearBuilt', 'SalePrice']data = pd.

read_csv('houseprice.

csv', usecols=cols_to_use)data.

head()# let's check for missing datadata[cols_to_use].

isnull().

sum()# let's plot the histograms to have an impression of the distribution of the numerical variablesfor col in cols_to_use: fig = data[col].

hist(bins=50) fig.

set_xlabel(col) fig.

set_label('Number of houses') plt.

show()#We observed that the numerical variables are not normally distributed.

In particular, most of them apart from YearBuilt are skewed.

# let's apply a transformation and see whether the variables are now more Gaussian shaped# on top of the histograms we plot now as well the Q-Q plotsfor col in cols_to_use: if col not in ['SalePrice', 'YearBuilt']: data[col+'_transformed'] = data[col]**(1/4) fig = data[col+'_transformed'].

hist(bins=50) fig.

set_xlabel(col) fig.

set_ylabel('Number of houses') plt.

show() stats.

probplot(data[data[col+'_transformed']!=0][col+'_transformed'], dist="norm", plot=pylab) plt.

show()We see that in the Q-Q plots, that most of the observations (blue dots) lie on the 45-degree line.

Therefore the transformation was successful in attributing a Gaussian-like shape to the variables.

# tried a few transformation on the variable YearBuilt without major success, you can go ahead and try other transformationsdata['YearBuilt_transformed'] = data['YearBuilt']**(1/-2)fig = data['YearBuilt_transformed'].

hist(bins=50)fig.

set_xlabel(col)plt.

show()stats.

probplot(data['YearBuilt_transformed'], dist="norm", plot=pylab)pylab.

show()However, the transformation of the variable ‘YearBuilt_transformed’ does not help to get a more ‘Gaussian’-like distribution.

Machine learning model performance on original vs transformed variables# let's separate into training and testing setX_train, X_test, y_train, y_test = train_test_split(data.

fillna(0), data.

SalePrice, test_size=0.

3,random_state=0)X_train.

shape, X_test.

shapeX_train.

head()y_train.

head()# create a list with the untransformed columnscols_to_use = cols_to_use[0:-1]cols_to_use# create a list with the transformed columnscols_transformed = [col+'_transformed' for col in cols_to_use]cols_transformed# let's standarise the datasetscaler = StandardScaler()X_train_o = scaler.

fit_transform(X_train[cols_to_use])X_test_o = scaler.

transform(X_test[cols_to_use])# let's standarise the datasetscaler = StandardScaler()X_train_t = scaler.

fit_transform(X_train[cols_transformed])X_test_t = scaler.

transform(X_test[cols_transformed])Linear Regressionlinreg = LinearRegression()linreg.

fit(X_train_o, y_train)print('Train set')pred = linreg.

predict(X_train_o)print('Linear Regression mse: {}'.

format(mean_squared_error(y_train, pred)))print('Test set')pred = linreg.

predict(X_test_o)print('Linear Regression mse: {}'.

format(mean_squared_error(y_test, pred)))print()cols_transformed = [col+'_transformed' for col in cols_to_use]linreg = LinearRegression()linreg.

fit(X_train_t, y_train)print('Train set')pred = linreg.

predict(X_train_t)print('Linear Regression mse: {}'.

format(mean_squared_error(y_train, pred)))print('Test set')pred = linreg.

predict(X_test_t)print('Linear Regression mse: {}'.

format(mean_squared_error(y_test, pred)))print()We can see that variable transformation improved the model performance, the mse on the test set is smaller when using the Linear Regression model built on the transformed variables (2.

7e6 vs 2.

4e6).

In addition, the mse on the train set is bigger, suggesting that the model built using the natural distributions is over-fitting to the train set (mse 1.

6e6 vs 1.

8e6).

rf = RandomForestRegressor(n_estimators=5, random_state=39, max_depth=2,min_samples_leaf=100)rf.

fit(X_train_o, y_train)print('Train set')pred = rf.

predict(X_train_o)print('Random Forests mse: {}'.

format(mean_squared_error(y_train, pred)))print('Test set')pred = rf.

predict(X_test_o)print('Random Forests mse: {}'.

format(mean_squared_error(y_test, pred)))print()print()rf = RandomForestRegressor(n_estimators=5, random_state=39, max_depth=2,min_samples_leaf=100)rf.

fit(X_train_t, y_train)print('Train set')pred = rf.

predict(X_train_t)print('Random Forests mse: {}'.

format(mean_squared_error(y_train, pred)))print('Test set')pred = rf.

predict(X_test_t)print('Random Forests mse: {}'.

format(mean_squared_error(y_test, pred)))print()print()As expected the Random Forests is not affected by the variable transformation.