Detecting a simple neural network architecture using NLP for email classificationHyper parameter optimization in email classification.

tannistha maitiBlockedUnblockFollowFollowingApr 19About a decade ago, spam brought email to near-ruin.

By 2015, Google says that its spam rate is down to 0.

1 percent, and its false positive rate has dipped to 0.

05 percent.

The significant drop in large part is due to the introduction of neural networks into its spam filters that can learn to recognize junk mail and phishing messages by analyzing scads off the stuff across an enormous collection.

Neural networks are powerful machine learning algorithms.

They can be used to transform the features so as to form fairly complex non linear decision boundaries.

They are primarily used for classification problems.

The fully connected layers take the deep representation from the CNN to RNN/LSTM and transforms it into the final output classes or class scores.

This post explains email classifications in ham and spam based on the following factors.

The use of word embedding method on email subject lines.

Hyperparameter tuning in terms of layers and number of nodes layers, and optimizer on email subject lines.

Compare the performance of CNN vs RNN method based on email subject lines.

Spam filters:A business can decide from the different type of available spam filter.

Some of the filters target the content of emails to determine if these are relevant to the business or not, whereas some filter check email headers of the messages.

There are also some filters that you can set to restrict the acceptance of emails from specific addresses, as well as some that allow you to set parameters for what kinds of emails need to be blocked.

There are even some that allow you to stop any email that comes from addresses that are on a list of blacklisted spammers.

In this study spam filters are developed based on the subject line of the emails.

Example of email subject lines used in this studyCardinality of dataset:The dataset is based on the cleaned Enron corpus, there is a total of 92188 messages belonging to 158 users with an average of 757 messages per user.

The dataset has almost an equal distribution of ham and spam emails.

19997 emails consisting of ham and spams are used as training and validation set with a split of 20% .

17880 emails are then used as a test set to identify accuracy and false positives.

Neural Network Architecture used for hyper parameter tuning.

**Hyperparameter tuning of layers and number of nodes layersArtificial neural networks have two main hyperparameters that control the architecture or topology of the network: (a) the number of layers and (b) the number of nodes in each hidden layer.

The most reliable way to configure these hyperparameters for a specific predictive modeling problem is via systematic experimentation with robust test.

A Multilayer Perceptron consists of a node, also called a neuron or perceptron, is a computational unit that has one or more weighted input connections, a transfer function that combines the inputs in some way, and an output connection.

Nodes are then organized into layers to comprise a network.

The types of layers in an MLP are as follows:1.

Input Layer: Input variables, sometimes called the visible layer.

2.

Hidden Layers: Layers of nodes between the input and output layers.

There may be one or more of these layers.

3.

Output Layer: A layer of nodes that produce the output variables.

4.

Size: The number of nodes in the model.

5.

Width: The number of nodes in a specific layer.

6.

Depth: The number of layers in a neural network.

7.

Architecture: The specific arrangement of the layers and nodes in the network.

Detecting hyperparameters are essential for three reasons (a) Find a simplistic neural network architecture to the particular problem.

(b) Find the right set of parameters from countless combinations of hyperparameters and (c) Replicate the neural network architecture for future use.

Table 1 gives details of three pre-trained deep neural net (DNN) model (Bi-directional LSTM, no-embedding CNN and embedding CNN) use the hidden layers, containing 16, 32, 64 and 128 neurons and 16, 32, 64 and 128 filters in first, second and third layers.

The activation is ReLu in all layers except for the dense layer where the activation is ‘tanh’.

The codes for CNN and Bi-directional LSTM network with embedding are presented here.

All models use a batch size of 16 and epoch of 2.

def embeddings(fl1=32, fl2=32, fl3=64, dl=16, optimizer= 'RMSprop', kl = 5, layer =1 ): sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') embedded_sequences = embedding_layer(sequence_input) if (layer == 1): x = Conv1D(filters = fl1, kernel_size = kl, activation='relu')(embedded_sequences) x = MaxPooling1D(pool_size = kl)(x) elif (layer == 2): x = Conv1D(filters = fl1, kernel_size = kl, activation='relu')(embedded_sequences) x = MaxPooling1D(pool_size = kl)(x) x = Conv1D(filters = fl2, kernel_size = kl, activation='relu')(x) x = MaxPooling1D(pool_size = kl)(x) else: x = Conv1D(filters = fl1, kernel_size = kl, activation='relu')(embedded_sequences) x = MaxPooling1D(pool_size = kl)(x) x = Conv1D(filters = fl2, kernel_size = kl, activation='relu')(x) x = MaxPooling1D(pool_size = kl)(x) x = Conv1D(filters = fl3, kernel_size = kl, activation='relu')(x) x = GlobalMaxPooling1D()(x) x = Dense(units = dl, activation='relu')(x) preds = Dense(1, activation='tanh')(x) model = Model(sequence_input, preds) model.

compile(loss= 'binary_crossentropy',optimizer= optimizer, metrics=['acc']) return modelTwo important techniques of neural network are Dropout and Activation.

hyperparameter tuning can also be done based on these techniques, but are not used here.

Dropout technique is used to improve the generalization error of large neural networks.

In this method the noise zeros, or drops out a fixed fraction of the activation of the neurons in a given layer.

Rectified linear unit (ReLU) uses the activation function max(0; x).

ReLUs are incorporate into a standard feed-forward neural net, to maintain the probabilistic model with the max(0; x).

def embedding_LSTM(fl1=16, fl2=16, fl3=16, dl=16, optimizer= 'RMSprop', kl = 5, layer =1): sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') embedded_sequences = embedding_layer(sequence_input) x = Bidirectional(LSTM(units = fl1, return_sequences=True))(embedded_sequences) x = GlobalMaxPool1D()(x) x = Dense(units=dl, activation="relu")(x) x = Dropout(0.

1)(x) preds = Dense(1, activation='tanh')(x) model = Model(sequence_input, preds) model.

compile(loss= 'binary_crossentropy', optimizer= optimizer, metrics=['acc']) return modelEmbedding:GloVe (Global vectors) used here is one of the approach where each word is mapped to 100-dimension vector.

These vectors can be used to learn the semantics of words like Man is Woman as King is to Queen.

Or Man + Female = Woman.

This embedding plays an important role in many applications.

It is kind of a transfer learning where word embedding are learnt from large corpus of data and then can be used on smaller datasets.

The vectors are generated by an unsupervised learning algorithm (PCA).

Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase linear substructures of the word vector space.

The main intuition underlying the model is the simple observation that the ratio of word-word co-occurrence probabilities have the potential for encoding some form of meaning.

The GloVe data set has 400000 words.

The Bag-of-words method is used as feature vectors.

An analysis of the tokens (words) revels that 24.

80% of the vocabulary and 43.

05% of all text are present in the embedding matrix.

Results and Analysis:Figure 1.

Results of hyperparameters vs validation loss in CNN models.

The results from Figure 1 shows that the number of neurons in the dense layer is not very conclusive, but 64 filters in the first layers give low validation loss.

Also it’s difficult to predict filter numbers and low validation loss in second and third layers respectively.

Figure 2.

3D visualization of hyperparameters based on validation loss, trainable parameters and dense layers in CNN models.

Figure 2 gives a good visualization of simple (low trainable parameters) and complex models (high trainable parameters).

In the lower left hand corner of the parameter space models with single neuron layers and trainable parameters ranging from 20k to 70k gives low validation loss and are good architectures.

Figure 3.

Results of hyperparameters vs validation loss in bi-directional LSTM models.

The results from Figure 3 shows that the 64 neurons in the dense layer is a good approximation, and 128 filters in the first layer are also acceptable results with low validation loss.

However, bi-directional LSTM shows that the more the trainable parameters are the better the validation loss is.

Figure 4.

3D visualization of hyperparameters based on validation loss, trainable parameters and dense layers in bi-directional LSTM models.

Figure 4 gives a good visualization of simple (low trainable parameters) and complex models (high trainable parameters) in bi-directional LSTM models .

At least 64/128 dense layer neurons are necessary to generate results with a low validation loss.

Also models with higher with trainable parameters have a low validation loss and are good architectures.

However, in comparison to CNN models the validation loss is less for LSTMs.

Table 2: Detailed results of top 5 models from three different types of architecture.

Qualitative analysis of Table 2 shows the results from the CNN and bidirectional LSTM networks with and without Global vector embedding.

The first five best models based on least validation loss are chosen.

The maximum accuracy is 58% that is generated by CNN with embedding.

The best accuracy and sensitivity of 73% is for a CNN model with trainable parameters of 66226.

The average accuracy is 50% with CNN model with no-embedding having lower accuracy than 50%.

The sensitivity of the models is more than 50% with a high of 73%.

It should be noted that the models with no-embedding have very low sensitivity.

The models with no-embedding have high true negative rates.

Also, Nadam optimizer works best with these kind of problems.

Conclusion:Referring back to the questions:The use of the word embedding method on email subject lines proved better than no embedding.

An extensive study was made to search the best architectures that generate low validation losses in training set.

LSMT (RNN) models with more trainable parameters gives the lowest set of validation loss in total.

Simple one layer models of CNN generates a slightly higher validation loss.

CNN models with embedding have the highest accuracies and true positive rates.

Future studies will include methods to deal with out of vocabulary words and use of character based model such that was designed by Google Brain team, the lm_1b model which includes 256 vectors (including 52 characters, special characters) and the dimension is just 16.

I welcome feedback and constructive criticism.

I can be reached through LinkedIn.

The code to this study is found here.

___________________________________________________________________**The total combination of hyper parameters in CNN models is 3072 but only the models with Adam and Nadam optimizer are considered since they generate the least validation loss.

Similar only Adam and Nadam optimizer are considered for LSTM.

.