Anomaly Detection with LSTM in Keras

Anomaly Detection with LSTM in KerasPredict Anomalies using Confidence IntervalsMarco CerlianiBlockedUnblockFollowFollowingJun 15Photo by Scott Umstattd on UnsplashI read ‘anomaly’ definitions in every kind of contest, everywhere.

In this caos the only truth is the variability of this definition, i.


anomaly explanation is completely releted to the domain of interest.

Detection of this kind of behavior is usefull in every business and the difficultness to detect this observations depends on the field of applications.

If you are engaged in a problem of anomaly detection, which involves human activity (like prediction of sales or demand), you can take advantages from fundamental assumptions of human behaviors and plan a more efficient solution.

This is exactly what we are doing in this post.

We try to predict the Taxi demand in NYC in a critical time period.

We formulate easy and important assumptions about human behaviors, which will permit us to detect an easy solution to forecast anomalies.

All the dirty job is made by a loyalty LSTM, developed in Keras, which makes predictions and detection of anomalies at the same time!THE DATASETI took the dataset for our analysis from Numenta community.

In particular I chose the NYC Taxi Dataset.

This dataset shows the NYC taxi demand from 2014–07–01 to 2015–01–31 with an observation every half hour.

Exemple of Weekly NORMAL observationsIn this period 5 anomalies are present, in term of deviation from a normal behavior.

They occur respectively during the NYC marathon, Thanksgiving, Christmas, New Years day, and a snow storm.

Exemple of Weekly ABNORMAL observations: NYC marathon — ChristmasOur purpose is to detect this abnormal observetion in advance!The first consideration we noticed, looking at the data, is the presence of an obvious daily pattern (during the day the demand is higher than night hours).

The taxi demand seems to be driven also by a weekly trend: in certain days of the week the taxi demand is higher than others.

We simply prove this computing autocorrelation.

timeLags = np.

arange(1,10*48*7)autoCorr = [df.


autocorr(lag=dt) for dt in timeLags]plt.



0/(48*7)*timeLags, autoCorr)plt.

xlabel('time lag [weeks]')plt.

ylabel('correlation coeff', fontsize=12)AutoCorrelation 10 weeks depthWhat we can do now is to take note of this important behaviours for our further analysis.

I compute and store the means for every days of the weeks at every hours.

This will be useful when we’ll standardized the data to build our model in order to reduce every kind of temporal dependency (I compute the means for the first 5000 observations that will become our future train set).

THE MODELWe need a strategy to detect outliers in advance.

To do this, we decided to care about taxi demand prediction.

We want to develop a model which is able to forecast demand taking into account uncertainty.

One way to do this is to develop quantile regression.

We focus on predictions of extreme values: lower (10th quantile), upper (90th quantile) and the classical 50th quantile.

Computing also the 90th and 10th quantile we cover the most likely values the reality can assume.

The width of this range can be very depth; we know that it is small when our model is sure about the future and it can be huge when our model isn’t able to see important changes in the domain of interest.

We took advantage from this behaviour and let our model says something about outliers detection in the field of taxi demand preditcion.

We are expecting to get a tiny interval (90–10 quantile range) when our model is sure about the future because it has all under control; on the other hand we are expecting to get an anomaly when the interval becomes bigger.

This is possible because our model isn’t trained to handle this kind of scenario which can results in anomaly.

We make all this magic reality building a simple LSTM Neural Network in Keras.

Our model will receive as input the past observations.

We resize our data for feeding our LSTM with daily window size (48 observations: one observation for every half hour).

When we were generating data, as I cited above, we operated logarithmic transformation and standardization subtracting the mean daily hour values, in order to see an observation as the logarithmic variation from its daily mean hour value.

We build our target variables in the same way with half hour shifting (we want to predict the demand values for the next thirty minutes).

inputs = Input(shape=(X_train.

shape[1], X_train.

shape[2]))lstm = Bidirectional(LSTM(64, return_sequences=True, dropout=0.

3))(inputs, training = True)lstm = Bidirectional(LSTM(16, return_sequences=False, dropout=0.

3))(lstm, training = True)dense = Dense(50)(lstm)out10 = Dense(1)(dense)out50 = Dense(1)(dense)out90 = Dense(1)(dense)model = Model(inputs, [out10,out50,out90])Operate quantile regression in Keras is very simple (I took inspiration from this post).

We easily define the custom quantile loss function which penalizes errors based on the quantile and whether the error was positive (actual > predicted) or negative (actual < predicted).

Our network has 3 outputs and 3 losses, one for every quantile we try to predict.

def q_loss(q,y,f): e = (y-f) return K.


maximum(q*e, (q-1)*e), axis=-1)losses = [lambda y,f: q_loss(0.

1,y,f), lambda y,f: q_loss(0.

5,y,f), lambda y,f: q_loss(0.


compile(loss=losses, optimizer='adam', loss_weights = [0.



3])CROSSOVER PROBLEMWhen dealing with Neural Network in Keras, one of the tedious problem is the uncertainty of results due to the internal weigths initialization.

With its formulation, our problem seems to particularly suffer of this kind of problem; i.


computing quantile predictions we can’t permit quantiles overlapping, this not make sense!.To avoid this pitfall I make use of bootstrapping in prediction phase: I reactivate dropout of my network (trainable: true in the model), iterate predition for 100 times, store them and finally calculate the desired quantiles (I make use of this clever technique also in this post).

pred_10, pred_50, pred_90 = [], [], []NN = K.



input, K.

learning_phase()], [model.


output, model.


output, model.


output])for i in tqdm.

tqdm(range(0,100)): predd = NN([X_test, 0.

5]) pred_10.

append(predd[0]) pred_50.

append(predd[1]) pred_90.

append(predd[2])This process is graphically explained below with a little focus on a subset of predictions.

Given quantile bootstraps, we calculated summary measures (red lines) of them, avoiding crossover.

q90 prediction bootstraps (cyan); q50 prediction bootstraps (blue); q10 prediction bootstraps (green)RESULTSAs I previously cited, I used the firstly 5000 observations for training and the remaining (around 5000) for testing.

Our model reaches a great performance forecasting taxi demand with the 50th quantile.

Around 0.

055 Mean Squared Log Error is a brilliant result!.This means that the LSTM Network is able to understand the underling rules that drive taxi demand.

So our approach for anomaly detection sounds great… We computed the difference among the 90th quantile predictions and 10th quantile predictions and see what’s appened.

Real (red line); Quantile interval lenght (blue dots)The quantile interval range (blue dots) is higher in period of uncertainty.

In the other cases, the model tends to generalize well, as we expected.

Going deeper, we start to investigate about these periods of high uncertainty.

We noticed that these coincide with our initial assumptions.

The orange circles plotted below are respectively: NYC marathon, Thanksgiving, Christmas, New Years day, and a snow storm.

Anomaly DetectionWe can conclude that we reach our initial targets: achive a great forecating power and exploit the strength of our model to identificate uncertainty.

We also make use of this to say something about anomalies detection.

SUMMARYIn this post I reproduce a good solution for anomaly detection and forecasting.

We make use of a LSTM Network to learn the behaviour of taxi demand in NYC.

We utilized what we learned to make predition and estimate uncertainty at the same time.

We implicitly define an anomaly as an unpredictable observation — i.


with a great amout of uncertainty.

This simple assumption permits to our LSTM to make all the work for us.

CHECK MY GITHUB REPOKeep in touch: Linkedin.. More details

Leave a Reply