Feature Engineering in Python: Outliers

Feature Engineering in Python: OutliersDiogo RibeiroBlockedUnblockFollowFollowingMay 2An outlier is a data point which is significantly different from the remaining data.

“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.

” [D.

Hawkins.

Identification of Outliers, Chapman and Hall , 1980.

]Should outliers be removed?Depending on the context, outliers either deserve special attention or should be completely ignored.

Take the example of revenue forecasting: if unusual spikes of revenue are observed, it’s probably a good idea to pay extra attention to them and figure out what caused the spike.

In the same way, an unusual transaction on a credit card is usually a sign of fraudulent activity, which is what the credit card issuer wants to prevent.

So in instances like these, it is useful to look for and investigate further outlier values.

If outliers are however, due to mechanical error, measurement error or anything else that can’t be generalised, it’s a good idea to filter out these outliers before feeding the data to the modeling algorithm.

Which machine learning models are sensitive to outliers?Some machine learning models are more sensitive to outliers than others.

For instance, AdaBoost may treat outliers as “hard” cases and put tremendous weights on outliers, therefore producing a model with bad generalization.

Decision trees tend to ignore the presence of outliers when creating the branches of their trees.

Typically, trees make decisions by asking if variable x >= value t, and therefore the outlier will fall on each end of the branch, but it will be treated equally than the remaining values, regardless of its magnitude.

Linear models, in particular, Linear Regression, can be sensitive to outliers.

A recent research article suggests that Neural Networks could also be sensitive to outliers, provided the number of outliers is high and the deviation is also high.

I would argue that if the number of outliers is high (>15% as suggested in the article), then they are no longer outliers, and rather fair representation of that variable.

How can outliers be identified?Outlier analysis and anomaly detection are a huge field of research devoted to optimize methods and create new algorithms to reliably identify outliers.

There are a huge number of ways optimized to detect outliers in different situations.

These are mostly targeted to identify outliers when those are the observations that we indeed want to focus on, for example for fraudulent credit card activity.

In this post, I would rather focus on identifying outliers introduced by mechanical error, so that we can process them before using them in machine learning algorithms.

Extreme Value AnalysisThe most basic form of outlier detection is Extreme Value Analysis of 1-dimensional data.

The key for this method is to determine the statistical tails of the underlying distribution of the variable and then finding the values that sit at the very end of the tails.

In the typical scenario, the distribution of the variable is Gaussian and thus outliers will lie outside the mean plus or minus 3 times the standard deviation of the variable.

If the variable is not normally distributed, a general approach is to calculate the quantiles, and then the interquartile range (IQR), as follows:IQR = 75th quantile — 25th quantileAn outlier will sit outside the following upper and lower boundaries:Upper boundary = 75th quantile + (IQR * 1.

5)Lower boundary = 25th quantile — (IQR * 1.

5)or for extreme cases:Upper boundary = 75th quantile + (IQR * 3)Lower boundary = 25th quantile — (IQR * 3)Real Life example:Predicting Survival on the Titanic: understanding society behaviour and beliefsPerhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board.

Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive.

Some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Therefore, we can learn about the society priorities and privileges at the time.

To download the Titanic data, go ahead to this websiteClick on the link ‘train.

csv’, and then click the ‘download’ blue button towards the right of the screen, to download the dataset.

Save it in a folder of your choice.

Note that you need to be logged in to Kaggle in order to download the datasets.

If you save it in the same directory from which you are running this notebook, and you rename the file to ‘titanic.

csv’ then you can load it the same way I will load it below.

import pandas as pdimport numpy as npimport matplotlib.

pyplot as plt%matplotlib inline# to display the total number columns present in the datasetpd.

set_option('display.

max_columns', None)# let's load the titanic datasetdata = pd.

read_csv('titanic.

csv')data.

head()import seaborn as snssns.

distplot(data.

Age.

fillna(95))There are 2 numerical variables in this dataset, Fare and Age.

So let’s go ahead and find out whether they present values that we could consider outliers.

Fare# First let's plot a histogram to get an idea of the distributionfig = data.

Fare.

hist(bins=50)fig.

set_title('Fare Distribution')fig.

set_xlabel('Fare')fig.

set_ylabel('Number of Passengers')The distribution of Fare is skewed, so in principle, we shouldn’t estimate outliers using the mean plus minus 3 standard deviations methods, which assumes a normal distribution of the data.

# another way of visualising outliers is using boxplots and whiskers,# which provides the quantiles (box) and inter-quantile range (whiskers),# with the outliers sitting outside the error bars (whiskers).

# All the dots in the plot below are outliers according to the quantiles + 1.

5 IQR rulefig = data.

boxplot(column='Fare')fig.

set_title('')fig.

set_xlabel('Survived')fig.

set_ylabel('Fare')Let’s look at the Values of the quantiles so we can calculate the upper and lower boundaries for the outliers:# 25%, 50% and 75% in the output below indicate the# 25th quantile, median and 75th quantile respectivelydata.

Fare.

describe()Let’s calculate the upper and lower boundaries to identify outliers according to the interquantile proximity rule:IQR = data.

Fare.

quantile(0.

75) – data.

Fare.

quantile(0.

25)Lower_fence = data.

Fare.

quantile(0.

25) – (IQR * 1.

5)Upper_fence = data.

Fare.

quantile(0.

75) + (IQR * 1.

5)Upper_fence, Lower_fence, IQRAnd if we are looking at really extreme values using the interquantile proximity rule:IQR = data.

Fare.

quantile(0.

75) – data.

Fare.

quantile(0.

25)Lower_fence = data.

Fare.

quantile(0.

25) – (IQR * 3)Upper_fence = data.

Fare.

quantile(0.

75) + (IQR * 3)Upper_fence, Lower_fence, IQRThe upper boundary for extreme outliers is a cost of 100 dollars for the Fare.

The lower boundary is meaningless because there can’t be a negative price for Fare.

Let’s look at the actual number of passengers on the upper Fare ranges:print('total passengers: {}'.

format(data.

shape[0]))print('passengers that paid more than 65: {}'.

format(data[data.

Fare > 65].

shape[0]))print('passengers that paid more than 100: {}'.

format(data[data.

Fare > 100].

shape[0]))And the percentages of passengers:total_passengers = np.

float(data.

shape[0])print('total passengers: {}'.

format(data.

shape[0] / total_passengers))print('passengers that paid more than 65: {}'.

format(data[data.

Fare > 65].

shape[0] / total_passengers))print('passengers that paid more than 100: {}'.

format(data[data.

Fare > 100].

shape[0] / total_passengers))When using the 3 times interquantile range interval to find outliers, we find that 6% of the passengers have paid extremely high fares.

We can go ahead and investigate the nature of this outliers.

Let’s create a separate dataframe for high fare payershigh_fare_df = data[data.

Fare>100]# ticket: it indicates the people that bought their fares togetherhigh_fare_df.

groupby('Ticket')['Fare'].

count()A group of people who bought their tickets together, say they were a family, would have the same ticket number.

And the fare attached to them is no longer the individual Fare, rather the group Fare.

This is why we see this unusually high values:multiple_tickets = pd.

concat([high_fare_df.

groupby('Ticket')['Fare'].

count(),high_fare_df.

groupby('Ticket')['Fare'].

mean()], axis=1)multiple_tickets.

columns = ['Ticket', 'Fare']multiple_tickets.

head(10)Therefore, the fare should be divided by the number of tickets bought together to find out the individual price.

So we see how finding out and investigating the presence of outliers, can lead us to new insight about the dataset at hand.

Go ahead and divide the Fare by the number of tickets bought together, and then repeat the finding outliers exercise on this newly created variable.

Do you know how to do this in python?If not, don’t worry, I will show you how to calculate individual ticket price in the final lecture of this course in the section “Putting it all together”.

For now, let’s just go ahead and visualize a group of people that were seemingly traveling together and therefore bought the tickets together:Let’s have a look at the most extreme outliers:data[data.

Fare>300]These three people have the same ticket number, indicating that they were traveling together.

The Fare price in this case, 512 is the price of 3 tickets, and not one.

This is why it is unusually high.

AgeFirst, let’s plot the histogram to get an idea of the distribution:fig = data.

Age.

hist(bins=50)fig.

set_title('Age Distribution')fig.

set_xlabel('Age')fig.

set_ylabel('Number of Passengers')Although the distribution of Age does not look strictly normal, we could assume normality and use the Gaussian approach to find outliers.

Now let’s plot the boxplots and whiskers to visualize outliers:# remember that the dots in the plot indicate outliers,# the box the interquantile range, and the whikers the# range IQR + or – 1.

5 times the quantilesfig = data.

boxplot(column='Age')fig.

set_title('')fig.

set_xlabel('Survived')fig.

set_ylabel('Age')Let’s visualize median and quantiles 25%, 50% and 75% in the output below indicate the 25th quantile, median and 75th quantile respectivelydata.

Age.

describe()Let’s calculate the boundaries outside which sit the outliers assuming Age follows a Gaussian distribution:Upper_boundary = data.

Age.

mean() + 3* data.

Age.

std()Lower_boundary = data.

Age.

mean() – 3* data.

Age.

std()Upper_boundary, Lower_boundaryThe upper boundary for Age is 73–74 years.

The lower boundary is meaningless as there can’t be negative age.

This value could be generated due to the lack of normality of the data.

Now let’s use the interquantile range to calculate the boundaries:IQR = data.

Age.

quantile(0.

75) – data.

Age.

quantile(0.

25)Lower_fence = data.

Age.

quantile(0.

25) – (IQR * 1.

5)Upper_fence = data.

Age.

quantile(0.

75) + (IQR * 1.

5)Upper_fence, Lower_fence, IQRAnd for extreme outliers:IQR = data.

Age.

quantile(0.

75) – data.

Age.

quantile(0.

25)Lower_fence = data.

Age.

quantile(0.

25) – (IQR * 3)Upper_fence = data.

Age.

quantile(0.

75) + (IQR * 3)Upper_fence, Lower_fence, IQRThe boundary using 1.

5 times the interquantile range coincides roughly with the boundary determined using the Gaussian distribution (64 vs 71 years).

The value using the 3 times the interquantile is a bit high according to normal human life expectancy, particularly in the days of the Titanic.

Let's find out whether there are outliers according to the above boundaries:# let's remove first the passengers with missing data for Agedata = data.

dropna(subset=['Age'])total_passengers = np.

float(data.

shape[0])print('passengers older than 73 (Gaussian approach): {}'.

format(data[data.

Age > 73].

shape[0] / total_passengers))print('passengers older than 65 (IQR): {}'.

format(data[data.

Age > 65].

shape[0] / total_passengers))print('passengers older than 91 (IQR, extreme): {}'.

format(data[data.

Age >= 91].

shape[0] / total_passengers))Roughly ~1–2 percent of the passengers were extremely old.

data[data.

Age>65]We can see that the majority of the outliers did not survive.

We have now identified a bunch of potential outliers.

Let’s see whether these affect the performance of the machine learning algorithms.

Measuring the effect of outliers on different machine learning algorithmsfrom sklearn.

linear_model import LogisticRegressionfrom sklearn.

ensemble import AdaBoostClassifierfrom sklearn.

ensemble import RandomForestClassifierfrom sklearn.

metrics import roc_auc_scorefrom sklearn.

model_selection import train_test_splitLet’s load the Titanic dataset again:data = pd.

read_csv('titanic.

csv')data.

head()Let’s find out if the variables contain missing datadata[['Age', 'Fare']].

isnull().

mean()Age contains 20 % of missing data.

For simplicity, I will fill the missing values with 0.

Let’s separate into training and testing set, remember that to avoid overfitting and improve generalization machine learning models need to be built on a train set and evaluated on a test set:​X_train, X_test, y_train, y_test = train_test_split(data[['Age', 'Fare']].

fillna(0),data.

Survived,test_size=0.

3,random_state=0)X_train.

shape, X_test.

shapeLet’s generate training and testing sets without outliers.

For simplicity, I will replace outliers by the upper boundary values, this procedure is called capping or top-coding and I will cover it extensively in the section of handling outliers later in other post# let's create a new datasetdata_clean = data.

copy()# replace outliers in Age# using the boundary from the Gaussian assumption methoddata_clean.

loc[data_clean.

Age >= 73, 'Age'] = 73# replace outliers in Fare# using the boundary of the interquantile range methoddata_clean.

loc[data_clean.

Fare > 100, 'Fare'] = 100# Let's divide into train and test setsX_train_clean, X_test_clean, y_train_clean, y_test_clean = train_test_split(data_clean[['Age', 'Fare']].

fillna(0),data_clean.

Survived,test_size=0.

3,random_state=0)Outlier effect on Logistic RegressionModel build on data with outliers:# call modellogit = LogisticRegression(random_state=44)# train modellogit.

fit(X_train, y_train)# make predicion on test setpred = logit.

predict_proba(X_test)print('LogReg Accuracy: {}'.

format(logit.

score(X_test, y_test)))print('LogReg roc-auc: {}'.

format(roc_auc_score(y_test, pred[:, 1])))Model build on data without outliers:# call modellogit = LogisticRegression(random_state=44)# trainlogit.

fit(X_train_clean, y_train_clean)# make prediction on test setpred = logit.

predict_proba(X_test_clean)print('LogReg Accuracy: {}'.

format(logit.

score(X_test_clean, y_test_clean)))print('LogReg roc-auc: {}'.

format(roc_auc_score(y_test_clean, pred[:, 1])))Outliers did not seem to have a big impact on the performance of Logistic Regression.

Outlier Effect on AdaboostModel built on data with outliers:# call modelada = AdaBoostClassifier(n_estimators=200, random_state=44)# train modelada.

fit(X_train, y_train)# make predictionspred = ada.

predict_proba(X_test)print('AdaBoost Accuracy: {}'.

format(ada.

score(X_test, y_test)))print('AdaBoost roc-auc: {}'.

format(roc_auc_score(y_test, pred[:, 1])))Model built on data without the outliers:# call modelada = AdaBoostClassifier(n_estimators=200, random_state=44)# train modelada.

fit(X_train_clean, y_train_clean)# make predictionspred = ada.

predict_proba(X_test_clean)print('AdaBoost Accuracy: {}'.

format(ada.

score(X_test_clean, y_test_clean)))print('AdaBoost roc-auc: {}'.

format(roc_auc_score(y_test_clean, pred[:,1])))On the other hand, we can see how removing outliers improves the performance of Adaboost: 0.

759 vs 0.

746 roc-auc.

Outlier Effect on Random ForestsModel built on data with outliers:# call modelrf = RandomForestClassifier(n_estimators=200, random_state=39)# train modelrf.

fit(X_train, y_train)# make predictionspred = rf.

predict_proba(X_test)print('Random Forests Accuracy: {}'.

format(rf.

score(X_test, y_test)))print('Random Forests roc-auc: {}'.

format(roc_auc_score(y_test, pred[:, 1])))Model built on data without outliers:# call modelrf = RandomForestClassifier(n_estimators=200, random_state=39)# train modelrf.

fit(X_train_clean, y_train_clean)# make predictionspred = rf.

predict_proba(X_test_clean)print('Random Forests Accuracy: {}'.

format(rf.

score(X_test_clean, y_test_clean)))print('Random Forests roc-auc: {}'.

format(roc_auc_score(y_test_clean, pred[:,1])))As expected, Random Forests do not benefit from removing outliers from the dataset.

Conclusion:We can see that the presence of outliers affects the performance of AdaBoost, and when outliers are removed the roc-auc improved by 0.

013.

Logistic Regression’s and Random Forests performances seemed unaffected by outliers.

.

. More details

Leave a Reply