Building a Spam Filter from Scratch Using Machine Learning

Sometimes, the Stemmer actually strips offadditional characters from the end, so “include”, “includes”, “included”,and “including” are all replaced with “includ”.Removal of non-words: Non-words and punctuation have been removed.All white spaces (tabs, newlines, spaces) have all been trimmedto a single space character.Generate Dictionary (Vocabulary)After we got the data prepared we can start creating the dictionary where we are gonna choose the features (words in this case) based on which the algorithm will later decide if given email message is spam or nonspam.First thing we need to do is to create the dictionary of words that will be used for our model..This code will take all the files (emails) under data folder and count the number of occurrences of each word..mail 1364…Now we have created our dictionary that and we are ready to go to the next step.Generating FeaturesIn this step we are going to extract the features from the train and test emails so the result structure is prepared as a input to the Naive Bayes algorithm in the next step of generating the prediction model.The dictionary that we already created contains all the 2500 words (features) based on which we will create the prediction model..The data structure will be like like the following.1 7 11 12 21 19 21 22 11 25 1Here the each row corresponds to:first column — document sequence numbersecond column — sequence number for the word in the dictionarythird column — number of occurrences of the word from dictionary in the given emailFor the first row it means that document number is 1 which is the 3–380msg4.txt from the nonspam-train folder under data folder.In the end we should get 4 .txt files.train-features.txt — which will contain the data structured as above for the emails from nonspam-train and spam-train folderstrain-labels.txt — which will contain one column and the number of rows will be equal to the number of processed emails..In the processing we include 50 emails from spam-train and nonspam-train and 130 (all) from spam-test and nonspam-test.First, we read the nonspam-train folder where for each email we count each word from the dictionary and if particular word occurs then we write it down as a row in featureTrain matrix (docId, wordId, count)..It counts the number of occurrences of a w (in our case one word from the dictionary) in all the c (sum of all occurrences of the dictionary words in ether spam or nonspam emails depending for which one we are estimating the probability)..The variables are self describing so please comment if you have anything else to ask.prob_token_spam — calculates the probability of occurrence of each word in the spam emailsprob_token_nonspam — calculates the probability of occurrence of each word in the nonspam emailsNow that we have calculated all the parameters of our model we can go to the final step.Test the ML ModelIn this final step we are going to test our model on the test data set that we already have..We calculate the prob_spam which is the probability that given email will be a spam email counting all the trained emails (spam + nonspam).Then we multiply each of the word occurrences for each email in the test_matrix with our generated model, prob_token_spam and prob_token_nonspam for spam and nonspam emails respectively.Basically our model represent the weight (impact) that each word has when deciding if given email is spam or nonspamThe output variable contains true/false parameters for each email.. More details

Leave a Reply