Machine Learning in Cyber Security — Malicious Software Installation

Introduction Monitoring of user activities performed by local administrators is always a challenge for SOC analysts and security professionals.

Most of the security framework will recommend the implementation of a whitelist mechanism.

However, the real world is often not ideal.

You will always have different developers or users having local administrator rights to bypass controls specified.

Is there a way to monitor the local administrator activities? Let’s talk about the data source An example of how the dataset looks like — the 3 entries listed above are referring to the same software We have a regular batch job to retrieve the software installed on each of the workstations which are located in different regions.

Most of the software installed is displayed in their local languages.

(Yes, you name it — it could be Japanese, French, Dutch ….

) So you will meet a situation that the software installed is displayed as 7 different names while it is referring to the same software in the whitelist.

Not to mention, we have thousands of devices.

Attributes of the dataset Hostname — The hostname of the devices Publisher Name — the software publisher Software Name — Software Name in Local Language and different versions number   Is there a way we could identify non-standard installation? My idea is that legit software used in the company — should have more than 1 installation and the software name should be different.

In such a case, I believe it will be effective to use machine learning to help a user classify the software and highlight any outlier.

Char processing using Term Frequency — Inverse Document Frequency (TF-IDF) Natural Language Processing (NLP) is a sub-field of artificial intelligence that deals with understanding and processing human language.

In light of new advancements in machine learning, many organizations have begun applying natural language processing for translation, chatbots, and candidate filtering.

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF-IDF is usually used for word extraction.

However, I was thinking about whether it could also be applicable for char extraction.

The idea is to explore how well we could apply TF-IDF to extract the features related to each character in the software name by exporting the importance of each character to the software name.

An example of the script below is going through how I apply the TF-IDF to the software name field in my data set.

import pandas as pd from sklearn.


text import TfidfVectorizer# Import the dataset df=pd.

read_csv(“your dataset”) # Extract the Manufacturer into List field_extracted = df[softwarename]# initialize the TF-IDF vectorizer = TfidfVectorizer(analyzer=char) vectors = vectorizer.

fit_transform(field_extracted) feature_names = vectorizer.

get_feature_names() dense = vectors.

todense() denselist = dense.

tolist() result = pd.

DataFrame(denselist, columns=feature_names) A snippet of the result: The result from the TF-IDF scripts above (with a mix of different languages e.


Korean, Chinese) In the above diagram, you could see that a calculation is performed to evaluate how “important” each char is on the software name.

This could also be interpreted as how “many” of the char specified is available on each of the software names.

In this way, you could present statistically on the characteristic of each “software name” and we could put these features into the machine learning model of your choice.

Other features I extracted and believe it will also be meaning to the models: The entropy of the Software Name import math from collections import Counter# Function of calculating Entropy def eta(data, unit=natural): base = { shannon : 2.

, natural : math.

exp(1), hartley : 10.

}if len(data) <= 1: return 0counts = Counter()for d in data: counts[d] += 1ent = 0probs = [float(c) / len(data) for c in counts.

values()] for p in probs: if p > 0.

: ent -= p * math.

log(p, base[unit])return ententropy = [eta(x) for x in field_extracted] Space Ratio — How many spaces the software name has Vowel Ratio — How many vowels (aeiou) the software name has At last, I have these features listed above with labels to run against randomtreeforest classifier.

You could select any classifier of your choice as long as it could give you a satisfactory result.

Thanks for reading! About the Author Elaine Hung Elaine is a machine learning enthusiast, digital forensic and incident response consultant.

Interested in applying ML and NLP on cyber security topics.

You can also read this article on our Mobile APP Related Articles (adsbygoogle = window.

adsbygoogle || []).


Leave a Reply