Forecasting: how to detect outliers?

is — of course — left to you to experiment…Do It YourselfPython If you have a pandas DataFrame with one column as the forecast and another one as the demand (the typical output from our exponential smoothing models), we can use this code:df[“Error”] = df[“Forecast”] — df[“Demand”]m = df[“Error”].

mean()s = df[“Error”].

std()from scipy.

stats import normlimit_high = norm.

ppf(0.

99,m,s)+df[“Forecast”]limit_low = norm.

ppf(0.

01,m,s)+df[“Forecast”]df[“Updated”] = df[“Demand”].

clip(lower=limit_low,upper=limit_high)print(df)Go the extra mile!If you think back about our idea to analyze the forecast error and make a threshold of acceptable errors, we actually still have a minor issue.

The threshold we compute is based on the dataset including the outliers.

This outlier drives the error variation upward so that the acceptable threshold is biased and overestimated.

To correct this, one could actually shrink the outlier not to the threshold calculated based on the original demand dataset but to a limit calculated on a dataset without this specific outlier.

Here’s the recipe:Populate a first forecast against the historical demand.

Compute the error, the error mean and the error standard deviationCompute the lower & upper acceptable thresholds (based on the error mean and standard deviation).

Identify outliers just as explained previously.

Re-compute the error mean and standard deviation but excluding the outliers.

Update the lower & upper acceptable thresholds based on these new values.

Update the outlier values based on the new threshold.

If we take back our seasonal example from above, we initially had a forecast error mean of 0.

4 and a standard deviation of 3.

22.

If we remove the point Y2 M11, we obtain an error mean of -0.

1 and a standard deviation of 2.

3.

That means that now the thresholds are -5.

3,5.

2 around the forecast.

Our outlier in Y2 M11 would then be updated to 10 (instead of 12 with our previous technique).

Do It YourselfWe’ll take back our code from our previous idea and add a new step to update the error mean and standard deviation values.

df[“Error”] = df[“Forecast”] — df[“Demand”]m = df[“Error”].

mean()s = df[“Error”].

std()from scipy.

stats import normprob = norm.

cdf(df[“Error”], m, s)outliers = (prob > 0.

99) | (prob < 0.

01)m2 = df[“Error”][~outliers].

mean()s2 = df[“Error”][~outliers].

std()limit_high = norm.

ppf(0.

99,m2,s2)+df[“Forecast”]limit_low = norm.

ppf(0.

01,m2,s2)+df[“Forecast”]df[“Updated”] = df[“Demand”].

clip(lower=limit_low,upper=limit_high)print(df)About the authorNicolas Vandeput is a supply chain data scientist specialized in demand forecasting & inventory optimization.

 In 2016, he founded SupChains (www.

supchains.

com), his consultancy company; two years later, he co-founded SKU Science (www.

skuscience.

com), a smart online platform for supply chain management.

If you are interested in forecast and machine learning, you can buy his book Data Science for Supply Chain Forecast.. More details

Leave a Reply