Wondering how to build an anomaly detection model?

Now, we’ll import the dataset and one thing to note here is that the dataset should be in the same directory where your script is present.

sio.

loadmat() loads our dataset(‘anomalyData.

mat’) into the variable dataset.

The variable ‘X’ contains the training dataset, ‘Xval’ the cross-validation set and ‘yval’ the corresponding output for the ‘Xval’.

Let’s see the array ‘X’ that we are going to use to fit it in a Gaussian model to detect anomalous examples.

print(X.

shape)(307, 2)As you can see there are 307 training examples and each having 2 features.

The features measure the throughput (mb/s) and latency (ms) of response of each server.

While your servers were operating, you collected m = 307 examples of how they were behaving, and thus have an unlabeled dataset {x(1), .

 .

 .

 , x(m)}.

You suspect that the vast majority of these examples are “normal” (non-anomalous) examples of the servers operating normally, but there might also be some examples of servers acting anomalously within this dataset.

Now, let’s visualize the dataset to have a clear picture.

Fig.

1Gaussian DistributionTo perform anomaly detection, you will first need to fit a model to the data’s distribution.

Given a training set {x(1), …, x(m)} (where x(i) ∈ R^n, here n = 2), you want to estimate the Gaussian distribution for each of the features.

For each feature (i = 1 .

 .

 .

n), you need to find parameters mean and variance(mu, sigma²).

For doing that let’s write down the function that calculates the mean and variance of the array(or you can call it matrix) X.

The mathematical expression goes something like this:Fig.

2We’ve got to calculate the mean for each feature and with the help of that, we calculate the variance of the corresponding features.

Let’s put it into code.

mu, sigma2 = estimateGaussian(X)print('mean: ',mu,' variance: ',sigma2) mean: [[14.

11222578 14.

99771051]] variance: [[1.

83263141 1.

70974533]]Now that we have the mean and variance we need to calculate the probability of the training examples in order to decide which examples are anomalous.

We can do it by using the Multivariate Gaussian model.

Multivariate Gaussian DistributionThe multivariate Gaussian is used to find the probability of each example and based on some threshold value we decide whether to flag an anomaly or not.

The expression for calculating the parameters for the Gaussian model are:Fig.

3Here, mu is the mean of each feature and the variable sigma calculates the covariance matrix.

These two parameters are used to calculate the probability p(x).

‘e’ is the threshold value that we are going to discuss in detail further.

Once you understand the expression, the code is very simple to implement.

Let’s see how to put it into code.

Inside the function, first, we convert the sigma2 vector into a covariance matrix and then we simply apply the formula for the multivariate distribution to get the probability vector.

If you’ve passed a vector in sigma2 you’ve to convert it into a matrix with the vector as the diagonal and rest of the element as zero(line 6).

p = multivariateGaussian(X, mu, sigma2)print(p.

shape)(307, 1)Thus, you’ve successfully calculated the probabilities.

Next, you’ve to calculate to the threshold value using some labelled data.

Let’s see how to do this.

pval = multivariateGaussian(Xval, mu, sigma2)We find the probabilities of ‘Xval’ to compare it with ‘yval’ for determining the threshold.

Let’s find the threshold value.

First, we find the stepsize to have a wide range of threshold values to decide the best one.

We use the F1 score method to determine the best parameters i.

e bestepsilon and bestF1.

Predict anomaly if pval<epsilon that gives a vector of binary values in the variable predict.

F1 score takes into consideration precision and recall.

In line number 19 I’ve implemented a for loop to calculate the tp, fp, and fn.

I’d love to hear from you if you could come out with some vectorised implementation for the Logic.

Fig.

4Precision = true positive/(true positive + false positive)Recall = true positive /(true positive + false negative)Best parameters are the ones in which the F1 score value is maximum.

Note: We are going to need a try-except block because there can be cases where we divide by zero to calculate precision and recall.

F1, epsilon = selectThreshHold(yval, pval)print('Epsilon and F1 are:',epsilon, F1)Output:Warning dividing by zero!!Epsilon and F1 are: 8.

990852779269493e-05 0.

8750000000000001Now, we have the best epsilon value and we are now in a position to calculate the anomalies on the training data’s probability.

We also call the anomalies as outliers.

outl = (p < epsilon)We need to return the indices of the outliers to identify the faulty servers.

This gives us a vector with binary entries where 1 means anomaly and 0 means normal.

Output:Number of outliers: 6 [300, 301, 303, 304, 305, 306]So, the faulty servers were as mentioned above.

We can also graphically spot the outliers.

The red circle shows the faulty servers in the networkCongratulations !!.We’ve successfully tested all our functions and we can use them on some real dataset to find out the anomalies.

Let’s finish what we’ve started.

Output:(1000, 11)(100, 11)(100, 1)The newDatset has 1000 examples each having 11 features.

The ‘Xvaltest’ is the cross-validation set for the test samples and ‘yvaltest’ the corresponding labels.

Now, do the same thing that you did for the dummy dataset.

Output:Warning dividing by zero!!Best epsilon and F1 are 1.

3772288907613575e-18 0.

6153846153846154Ptest contains the prediction for the test samples and pvaltest for the cross-validation set.

The best epsilon value comes out to be the order of exp(-18).

Check for the Outliers:Output:Outliers are: [9, 20, 21, 30, 39, 56, 62, 63, 69, 70, 77, 79, 86, 103, 130, 147, 154, 166, 175, 176, 198, 209, 212, 218, 222, 227, 229, 233, 244, 262, 266, 271, 276, 284, 285, 288, 289, 290, 297, 303, 307, 308, 320, 324, 338, 341, 342, 344, 350, 351, 353, 365, 369, 371, 378, 398, 407, 420, 421, 424, 429, 438, 452, 455, 456, 462, 478, 497, 518, 527, 530, 539, 541, 551, 574, 583, 587, 602, 613, 614, 628, 648, 674, 678, 682, 685, 700, 702, 705, 713, 721, 741, 750, 757, 758, 787, 831, 834, 836, 839, 846, 870, 885, 887, 890, 901, 911, 930, 939, 940, 943, 951, 952, 970, 975, 992, 996]Number of outliers are: 117Thus, there are 117 outliers and their corresponding indices are given as above.

ConclusionI know that starting from scratch could be a messy thing sometimes but you would get to learn a lot of details if you do it from scratch.

You could find the notebook here for the above project.

I hope I could teach you something new and make sure you check out my repository I’ve made other projects like DigitRecognition, Clustering a bird’s image etc.

Peace.. More details

Leave a Reply