An Awesome Tutorial to Learn Outlier Detection in Python using PyOD Library

  PyOD on the Big Mart Sales Problem Now, let’s see how PyOD does on the famous Big Mart Sales Problem.

Go ahead and download the dataset from the above link.

 Let’s start with importing the required libraries and loading the data: import pandas as pd import numpy as np # Import models from pyod.

models.

abod import ABOD from pyod.

models.

cblof import CBLOF from pyod.

models.

feature_bagging import FeatureBagging from pyod.

models.

hbos import HBOS from pyod.

models.

iforest import IForest from pyod.

models.

knn import KNN # reading the big mart sales training data df = pd.

read_csv(“train.

csv”) Let’s plot Item MRP vs Item Outlet Sales to understand the data: df.

plot.

scatter(Item_MRP,Item_Outlet_Sales) The range of Item Outlet Sales is from 0 to 12000 and Item MRP is from 0 to 250.

We will scale down both these features to a range between 0 and 1.

This is required to create a explainable visualization (it will become way too stretched otherwise).

As for this data, using the same approach will take much more time to create the visualization.

Note: If you don’t want the visualization, you can use the same scale to predict whether a point is an outlier or not.

from sklearn.

preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) df[[Item_MRP,Item_Outlet_Sales]] = scaler.

fit_transform(df[[Item_MRP,Item_Outlet_Sales]]) df[[Item_MRP,Item_Outlet_Sales]].

head() Store these values in the NumPy array for using in our models later: X1 = df[Item_MRP].

reshape(-1,1) X2 = df[Item_Outlet_Sales].

reshape(-1,1) X = np.

concatenate((X1,X2),axis=1) Again, we will create a dictionary.

But this time, we will add some more models to it and see how each model predicts outliers.

You can set the value of the outlier fraction according to your problem and your understanding of the data.

In our example, I want to detect 5% observations that are not similar to the rest of the data.

So, I’m going to set the value of outlier fraction as 0.

05.

random_state = np.

random.

RandomState(42) outliers_fraction = 0.

05 # Define seven outlier detection tools to be compared classifiers = { Angle-based Outlier Detector (ABOD): ABOD(contamination=outliers_fraction), Cluster-based Local Outlier Factor (CBLOF):CBLOF(contamination=outliers_fraction,check_estimator=False, random_state=random_state), Feature Bagging:FeatureBagging(LOF(n_neighbors=35),contamination=outliers_fraction,check_estimator=False,random_state=random_state), Histogram-base Outlier Detection (HBOS): HBOS(contamination=outliers_fraction), Isolation Forest: IForest(contamination=outliers_fraction,random_state=random_state), K Nearest Neighbors (KNN): KNN(contamination=outliers_fraction), Average KNN: KNN(method=mean,contamination=outliers_fraction) } Now, we will fit the data to each model one by one and see how differently each model predicts the outliers.

xx , yy = np.

meshgrid(np.

linspace(0,1 , 200), np.

linspace(0, 1, 200)) for i, (clf_name, clf) in enumerate(classifiers.

items()): clf.

fit(X) # predict raw anomaly score scores_pred = clf.

decision_function(X) * -1 # prediction of a datapoint category outlier or inlier y_pred = clf.

predict(X) n_inliers = len(y_pred) – np.

count_nonzero(y_pred) n_outliers = np.

count_nonzero(y_pred == 1) plt.

figure(figsize=(10, 10)) # copy of dataframe dfx = df dfx[outlier] = y_pred.

tolist() # IX1 – inlier feature 1, IX2 – inlier feature 2 IX1 = np.

array(dfx[Item_MRP][dfx[outlier] == 0]).

reshape(-1,1) IX2 = np.

array(dfx[Item_Outlet_Sales][dfx[outlier] == 0]).

reshape(-1,1) # OX1 – outlier feature 1, OX2 – outlier feature 2 OX1 = dfx[Item_MRP][dfx[outlier] == 1].

values.

reshape(-1,1) OX2 = dfx[Item_Outlet_Sales][dfx[outlier] == 1].

values.

reshape(-1,1) print(OUTLIERS : ,n_outliers,INLIERS : ,n_inliers, clf_name) # threshold value to consider a datapoint inlier or outlier threshold = stats.

scoreatpercentile(scores_pred,100 * outliers_fraction) # decision function calculates the raw anomaly score for every point Z = clf.

decision_function(np.

c_[xx.

ravel(), yy.

ravel()]) * -1 Z = Z.

reshape(xx.

shape) # fill blue map colormap from minimum anomaly score to threshold value plt.

contourf(xx, yy, Z, levels=np.

linspace(Z.

min(), threshold, 7),cmap=plt.

cm.

Blues_r) # draw red contour line where anomaly score is equal to thresold a = plt.

contour(xx, yy, Z, levels=[threshold],linewidths=2, colors=red) # fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score plt.

contourf(xx, yy, Z, levels=[threshold, Z.

max()],colors=orange) b = plt.

scatter(IX1,IX2, c=white,s=20, edgecolor=k) c = plt.

scatter(OX1,OX2, c=black,s=20, edgecolor=k) plt.

axis(tight) # loc=2 is used for the top left corner plt.

legend( [a.

collections[0], b,c], [learned decision function, inliers,outliers], prop=matplotlib.

font_manager.

FontProperties(size=20), loc=2) plt.

xlim((0, 1)) plt.

ylim((0, 1)) plt.

title(clf_name) plt.

show() OUTPUT OUTLIERS : 447 INLIERS : 8076 Angle-based Outlier Detector (ABOD) OUTLIERS : 427 INLIERS : 8096 Cluster-based Local Outlier Factor (CBLOF) OUTLIERS : 386 INLIERS : 8137 Feature Bagging OUTLIERS : 501 INLIERS : 8022 Histogram-base Outlier Detection (HBOS) OUTLIERS : 427 INLIERS : 8096 Isolation Forest OUTLIERS : 311 INLIERS : 8212 K Nearest Neighbors (KNN) OUTLIERS : 176 INLIERS : 8347 Average KNN In the above plots, the white points are inliers surrounded by red lines, and the black points are outliers in the blue zone.

  End Notes That was an incredible learning experience for me as well.

I spent a lot of time researching PyOD and implementing it in Python.

I would encourage you to do the same.

Practice using it on different datasets – it’s such a useful library!.PyOD already supports around 20 classical outlier detection algorithms which can be used in both academic and commercial projects.

Its contributors are planning to enhance the toolbox by implementing models that will work well with time series and geospatial data.

If you have any suggestions/feedback related to the article, please post them in the comments section below.

I look forward to hearing your experience using PyOD as well.

Happy learning.

Check out the below awesome courses to learn data science and it’s various aspects: Introduction to Data Science (Certified Course) Python for Data Science Big Mart Sales Prediction using R You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window)Like this:Like Loading.

Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.

. More details

Leave a Reply