A Gaussian Approach to detection of Anomalous Behavior in Server Computers

A Gaussian Approach to detection of Anomalous Behavior in Server ComputersLet’s detect the anomaly…Navoneel ChakrabartyBlockedUnblockFollowFollowingMar 24Anomaly Detection is a different variant of Machine Learning Problems that falls under Semi-Supervised Learning.

It is Semi-Supervised because in Anomaly Detection (also popularly known as Outlier Detection), models often involve parameters that are fit using the Validation Set labels whereas the training procedure does not involve Training Set labels.

Also the Test Set labels are used for evaluating model performance metrics like Accuracy, Precision, Recall, F1-Score and AUROC (Area Under the ROC Curve).

One such common approach for Anomaly Detection is the Gaussian Distribution.

In this approach, all the features are modeled on a Gaussian Distribution and given a new data-point, probability of the data-point is given by Gaussian/Normal Distribution Function.

If the probability is below a particular threshold (which is set depending upon the performance of the model on the Validation Set), the new data-point is claimed to be an outlier or anomalous.

Gaussian/Normal Distribution with Mean = µ and Standard Deviation = σAccording to Gaussian/Normal Distribution:p(x) is the Probability of x in Gaussian DistributionGaussian Distribution Anomaly Detection Algorithm:Let, there are m data-points (instances) each with n selected features.

The Mean parameter for each feature (j = 1 to n) is fit.

The Variance parameter for each feature (j = 1 to n) is fit.

Given a new data-point, x = {x_1, x_2, … , x_j}, p(x) is given by,or,Now, a threshold parameter, ε is selected such that,if p(x) < ε: # x is an ANOMALY or OUTLIER !!!!else: # x is NOT an ANOMALY or OUTLIER !!!!Application of Gaussian Distribution Model for Anomaly Detection on a Server Computer Dataset in PythonProblem Statement: “Detect the anomalous behavior in Server Computer”The dataset is available on GitHub link,navoneel1092283/Server_Computer_DatasetComputer Server Dataset with Throughput (in mb/s) and Latency (in ms) as features for detection of Anomalous Behavior …github.

comI.

Data Reading (dataset present in .

mat format)import scipy.

iodata = scipy.

io.

loadmat('data.

mat')X = data['Xval'] # featuresy = data['yval'] # class labels (0->Non-Anomalous, 1->Anomalous)II.

Data Visualization (in the form of Scatter Plot)import matplotlib.

pyplot as plt%matplotlib inlineplt.

scatter(X.

T[0], X.

T[1])plt.

xlabel('Latency (ms)')plt.

ylabel('Throughput (mb/s)')plt.

show()Scatter Plot for the 2 features, Latency and ThroughputIII.

Implementation of the Gaussian Distribution Algorithm for Anomaly or Outlier Detectionimport numpy as npfrom math import *def gaussian(X, x, epsilon): # X represents the Training Set Features # x represents the set of new data-points (Validation/Test Set) mean = np.

zeros(X.

shape[1]) std = np.

zeros(X.

shape[1]) Xt = X.

T xt = x.

T p = np.

zeros(x.

shape[0]) # vector of output predictions for i in range(0, X.

shape[1]): mean[i] = Xt[i].

mean() std[i] = Xt[i].

std()for i in range(0, x.

shape[0]): prob = 1 for j in range(0, X.

shape[1]): prob = prob * (1/sqrt(2*3.

14)) * exp(-pow((xt[j][i] – mean[j]),2)/2 * std[j] * std[j]) if prob < epsilon: p[i] = 1 return pIV.

Preparation of Training Set, Validation Set and Test SetHere, there have been certain conventional rules regarding the preparation of the Training Set, Validation Set and Test Set:The Training Set should have 60% (approx.

) of the total number of instances present in the dataset.

All the instances in the Training Set should be non-anomalous (as per the labels).

The Validation Set should have 20% (approx.

) of the total number of instances containing both anomalous and non-anomalous examples (as per the labels).

The Test Set, containing the remaining instances should also have both anomalous and non-anomalous examples (as per labels).

# Inspecting the distribution of class labels in the dataset.

unique, counts = np.

unique(y, return_counts=True)print(dict(zip(unique, counts)))There are 298 non-anomalous and 9 anomalous examplesitemindex = np.

where(y==1) # storing indices of anomalous examples# Training Set Preparationtraining_set = np.

ones((int(0.

6*X.

shape[0]),X.

shape[1]))y_train = np.

ones(int(0.

6*X.

shape[0]))count = 0i = 0while(count < int(0.

6*X.

shape[0])): if i not in itemindex[0]: training_set[count] = X[i] y_train[count] = y[i] count = count + 1 i = i + 1# Validation Set Preparationvalidation_set = np.

ones((int(0.

2*X.

shape[0]+1),X.

shape[1]))y_validation = np.

ones(int(0.

2*X.

shape[0]+1))count = 0 while(count <= int(0.

2*X.

shape[0]) – 5): validation_set[count] = X[i] y_validation[count] = y[i] count = count + 1 i = i + 1 for j in range(1,6): validation_set[-j] = X[itemindex[0][j-1]] y_validation[-j] = y[itemindex[0][j-1]]# Test Set Preparationtest_set = np.

ones((int(0.

2*X.

shape[0]),X.

shape[1]))y_test = np.

ones(int(0.

2*X.

shape[0]))count = 0while(count < int(0.

2*X.

shape[0]) – 4): if i not in itemindex[0]: test_set[count] = X[i] y_test[count] = y[i] count = count + 1 i = i + 1for j in range(6,10): test_set[count] = X[itemindex[0][j-1]] y_test[count] = y[itemindex[0][j-1]] count = count + 1V.

Training the Model with tuned Threshold parameter, ε = 0.

0001 (found to give the best performance on the Validation Set)predictions_validation = gaussian(training_set, validation_set , 0.

0001)VI.

Performance Analysis on the Validation Set# Accuracy Calculation.

k = 0for i in range(0, y_validation.

shape[0]): if predictions_validation[i] == y_validation[i]: k = k + 1accuracy = k/y_validation.

shape[0]print("Validation Accuracy: ", accuracy)# Precision Calculation.

tp = fp = 0# tp -> True Positive, fp -> False Positivefor i in range(0, predictions_validation.

shape[0]): if predictions_validation[i] == y_validation[i] == 0: tp = tp + 1 elif predictions_validation[i] == 0 and y_validation[i] == 1: fp = fp + 1precision = tp/(tp + fp)print("Precision on the Validation Set: ", precision)# Recall Calculation.

fn = 0fn = 0# fn -> False Negativesfor i in range(0, predictions_validation.

shape[0]): if predictions_validation[i] == 1 and y_validation[i] == 0: fn = fn + 1recall = tp/(tp + fn)print("Recall on the Validation Set: ", recall)# F1-Score Calculation.

f1_score = (2 * precision * recall)/(precision + recall)print("F1-Score on the Validation Set: ", f1_score)Performance on Validation SetVII.

Scatter Plot Performance Visualization on the Validation Set# SCATTER PLOT WITH DATA-POINT HAVING ACTUAL LABELSitemindex = np.

where(y_validation==1)validation_non_anomalous = np.

zeros((y_validation.

shape[0] – itemindex[0].

shape[0] , validation_set.

shape[1]))count = 0for i in range(0, validation_set.

shape[0]): if i not in itemindex[0]: validation_non_anomalous[count] = validation_set[i] count = count + 1 i = i + 1validation_anomalous = np.

zeros((itemindex[0].

shape[0] , validation_set.

shape[1]))count = 0for i in itemindex[0]: validation_anomalous[count] = validation_set[i] count = count + 1plt.

scatter(validation_non_anomalous.

T[0], validation_non_anomalous.

T[1], c = "green", label="Non-Anomalous")plt.

scatter(validation_anomalous.

T[0], validation_anomalous.

T[1], c = "red", label="Anomalous")plt.

xlabel('Latency (ms)')plt.

ylabel('Throughput (mb/s)')plt.

legend()plt.

show()# SCATTER PLOT WITH DATA-POINT HAVING LABELS GIVEN BY THE MODELitemindex = np.

where(predictions_validation==1)validation_predicted_non_anomalous = np.

zeros((y_validation.

shape[0] – itemindex[0].

shape[0], validation_set.

shape[1]))count = 0for i in range(0, validation_set.

shape[0]): if i not in itemindex[0]: validation_predicted_non_anomalous[count]=validation_set[i] count = count + 1 i = i + 1validation_predicted_anomalous = np.

zeros((itemindex[0].

shape[0] , validation_set.

shape[1]))count = 0for i in itemindex[0]: validation_predicted_anomalous[count] = validation_set[i] count = count + 1plt.

scatter(validation_predicted_non_anomalous.

T[0], validation_predicted_non_anomalous.

T[1], c = "green", label="Non-Anomalous")plt.

scatter(validation_predicted_anomalous.

T[0], validation_predicted_anomalous.

T[1], c = "red", label="Anomalous")plt.

xlabel('Latency (ms)')plt.

ylabel('Throughput (mb/s)')plt.

legend()plt.

show()Scatter Plot with Actual Labels VS Scatter Plot with Predicted Labels for Validation SetVIII.

Performance Analysis on the Test Setpredictions_test = gaussian(training_set, test_set, 0.

0001)# Accuracy Calculation.

k = 0for i in range(0, y_test.

shape[0]): if predictions_test[i] == y_test[i]: k = k + 1accuracy = k/y_test.

shape[0]print("Test Accuracy: ", accuracy)# Precision Calculation.

tp = fp = 0# tp -> True Positive, fp -> False Positivefor i in range(0, predictions_test.

shape[0]): if predictions_test[i] == y_test[i] == 0: tp = tp + 1 elif predictions_test[i] == 0 and y_test[i] == 1: fp = fp + 1precision = tp/(tp + fp)print("Precision on the Test Set: ", precision)# Recall Calculation.

fn = 0fn = 0# fn -> False Negativesfor i in range(0, predictions_test.

shape[0]): if predictions_test[i] == 1 and y_test[i] == 0: fn = fn + 1recall = tp/(tp + fn)print("Recall on the Test Set: ", recall)# F1-Score Calculation.

f1_score = (2 * precision * recall)/(precision + recall)print("F1-Score on the Test Set: ", f1_score)Performance on the Test SetIX.

Scatter Plot Performance Visualization on the Test Set# SCATTER PLOT WITH DATA-POINT HAVING ACTUAL LABELSitemindex = np.

where(y_test==1)test_non_anomalous = np.

zeros((y_test.

shape[0] – itemindex[0].

shape[0] , test_set.

shape[1]))count = 0for i in range(0, test_set.

shape[0]): if i not in itemindex[0]: test_non_anomalous[count] = test_set[i] count = count + 1 i = i + 1test_anomalous = np.

zeros((itemindex[0].

shape[0] , test_set.

shape[1]))count = 0for i in itemindex[0]: test_anomalous[count] = test_set[i] count = count + 1plt.

scatter(test_non_anomalous.

T[0], test_non_anomalous.

T[1], c = "green", label="Non-Anomalous")plt.

scatter(test_anomalous.

T[0], test_anomalous.

T[1], c = "red" , label="Anomalous")plt.

xlabel('Latency (ms)')plt.

ylabel('Throughput (mb/s)')plt.

legend()plt.

show()# SCATTER PLOT WITH DATA-POINT HAVING LABELS GIVEN BY THE MODELitemindex = np.

where(predictions_test==1)test_predicted_non_anomalous = np.

zeros((y_test.

shape[0] – itemindex[0].

shape[0], test_set.

shape[1]))count = 0for i in range(0, test_set.

shape[0]): if i not in itemindex[0]: test_predicted_non_anomalous[count]=test_set[i] count = count + 1 i = i + 1test_predicted_anomalous = np.

zeros((itemindex[0].

shape[0] , test_set.

shape[1]))count = 0for i in itemindex[0]: test_predicted_anomalous[count] = test_set[i] count = count + 1plt.

scatter(test_predicted_non_anomalous.

T[0] , test_predicted_non_anomalous.

T[1], c = "green" , label="Non-Anomalous")plt.

scatter(test_predicted_anomalous.

T[0] , test_predicted_anomalous.

T[1], c = "red" , label="Anomalous")plt.

xlabel('Latency (ms)')plt.

ylabel('Throughput (mb/s)')plt.

legend()plt.

show()Scatter Plot with Actual Labels VS Scatter Plot with Predicted Labels for Test Set [EXACTLY SAME]So, the Gaussian Distribution Algorithm correctly identifies all the outliers or anomalies in the Test Set without falsely predicting a non-anomalous instance as anomalous.

There are many other advanced Anomaly Detection Models like Bayesian Networks, Hidden Markov models (HMMs), Cluster analysis-based outlier detection etc.

I’ll go through those approaches in my upcoming articles.

For Personal Contacts regarding the article or discussions on Machine Learning/Data Mining or any department of Data Science, feel free to reach out to me on LinkedInNavoneel Chakrabarty – Contributing Author – Towards Data Science | LinkedInwww.

linkedin.

com.

. More details

Leave a Reply