Roadmap for multi-class sentiment analysis with deep learning

Roadmap for multi-class sentiment analysis with deep learningAman DeepBlockedUnblockFollowFollowingJan 19A practical guide to create incrementally better modelsSentiment analysis quickly gets difficult as we increase the number of classes.

For this blog, we’ll have a look at what difficulties you might face and how to get around them when you try to solve such a problem.

Instead of prioritizing theoretical rigor, I’ll focus on how to practically apply some ideas on a toy dataset and how to edge yourself out of a rut.

I’ll be using Keras throughout.

As a disclaimer, I’d say it’s unwise to throw the most powerful model at your problem at first glance.

Traditional natural language processing methods work surprisingly well on most problems and your initial analysis of the dataset can be built upon with deep learning.

However, this blog aims to be a refresher for deep learning techniques exclusively and an implementational baseline or a general flowchart for hackathons or competitions.

Theory throughout this post will either be oversimplified or absent, to avoid losing the attention of the casual reader.

The problemWe’ll analyze a fairly simple dataset I recently came across, which can be downloaded from here.

About 50 thousand people were asked to respond to a single question,“What is one recent incident that made you happy?”Their responses were tabulated and their reason of happiness was categorized into seven broad classes like ‘affection’, ‘bonding’, ‘leisure’, etc.

Additionally, we also know whether the incident happened within 24 hours of the interview or not.

  This problem is quite different from your regular positive negative classification because even though there are seven classes, all the responses are inherently happy and differentiating between them might be quite difficult even for humans.

Before we start, this is where you’ll find the complete notebook for this blog as well as all the discussed architectures in separate files if you want to tinker with them yourself.

You are free to use whatever you find there, however you like, no strings attached.

This blog visits these topics:PreprocessingSentence embedding using tf-idfGenerating word embeddings from a corpus using gensim and kerasUsing pretrained word-vectors like GloVe and GoogleNews-Word2VecAnalyzing causes of flatliningDealing with class imbalanceOversampling, undersampling, cost-sensitive learningUsing alternate activation functionsELMo sentence embeddingsEnsemblingThe datasetLet’s see what we’re working with.

Here’s what each column meansid is just a unique id for each sentenceperiod is the period during which the interviewee had their experience, which can be either during the last 24 hours (24h) or the last 3 months (3m)response is the response of the interviewee and the most important independent variablen is the number of sentences in the response, andsentiment is our target variablePre-processingTo keep the first model simple, we’ll go ahead and drop the n column.

We'll see soon that it doesn't matter anyway.

We'll also drop the id column because that's just a random number  .

or is it?  Assuming anything about the data beforehand will almost always mislead our model.

For example, it might be possible that while collecting the data, the ids were assigned serially and it just so happened that every fifth observation was taken in a park full of people, where the predominant cause of happiness was exercise or nature.

This is probably useless in the real world, but insights like these might win you a hackathon.

We'll keep it to track if our shuffles are working correctly but we won't be using it for training our models.

  And we'll obviously drop the sentiment column as it is the target variable.


drop(['n', 'sentiment'], axis=1, inplace=True)Usually with these problems, the classes are not always balanced, but we’ll worry about that later.

First, we want to get a simple model up and running to compare our future models with.

Let’s quickly convert our categories into one-hot arrays so that they look like this.

Converting the response column to lowercase.


response = df.



lower)All the steps upto here are dataset independent.

We would have to go through the same preprocessing steps for our test set as well as all the other models we’ll try, regardless of architecture.

Post-processingOur first few models will follow the traditional approach of doing a lot of work ourselves and gradually move on to higher and higher levels of abstraction.

However, the preprocessing step will be common across all pipelines.

  Neural networks cannot process strings, let alone strings of arbitrary size, so we first split them at punctuations and spaces after lowercasing the sentence.

This is called tokenization (well…it’s a bit more complicated than what I just said).

  We’ll use the word_tokenize function from nltk to help us with this.

def tokenize(df): df['tokens']=df['response'].

map(lambda x: nltk.

word_tokenize(x))tokenize(df)Stopwords are words that appear way too frequently in the English language to be actually meaningful, like ‘a’, ‘an’, ‘the’, ‘there’, etc.


corpus has a handy stopwords function that enumerates these.

We could do a stopword removal process while tokenization, but I decided against it as it might affect the context.

The stopword corpus includes a 'not', a negation that can flip the emotion of the passage.

Moreover, phrases like 'To be or not to be' would be entirely removed.

We could make our own corpus of stopwords, but the performance would hardly improve as our dataset is pretty small already.

So we drop the idea and move on.

Once we have the tokens, we don’t need the original responses, because our model can’t make any sense of it anyway.

It’s a great time now to separate a part of the training set into the validation set, to make sure we aren’t cheating.

As the data is unstructured, a random shuffle will work just fine.

df_train, df_val, y_train, y_val = train_test_split(df, y, test_size=0.

15, random_state=42)Remove the random-seed parameter if you want a new permutation every run.

These are the shapes of df_train y_train df_val and y_val respectively.

(46172, 3) (46172, 7)(8149, 3) (8149, 7)EmbeddingsThere is just one more problem.

Neural networks work on strictly numerical data and still can’t make sense of the tokens in our dataset.

We need to find a way to represent each word as a vector, somehow.

 Let’s take a little detour.

 Suppose we want to differentiate between pop and metal.

What are some properties we can use to describe these genres? Let’s use percussion, electric guitar, acoustic guitar, synth, happiness, sadness, anger and complexity as the features to describe each genre.

The vector for pop might look something like(0.

5, 0.

2, 0.

5, 1.

0, 0.

8, 0.

5, 0.

2, 0.

3)and the one for metal might look like(0.

9, 0.

9, 0.

3, 0.

1, 0.

4, 0.

5, 0.

8, 0.

7)So if we want to classify heavy-metal, its vector might be(1.

0, 1.

0, 0.

0, 0.

1, 0.

1, 0.

5, 1.

0, 0.

9)These vectors can be plotted in an 8-dimensional space and the euclidean distance (np.


norm) between metal and heavy-metal (0.

529) will be closer than the euclidean distance between pop and metal (1.

476), for example.

 Similarly, we can encode every single word in our corpus in some way, to form a vector.

We have algorithms that can train a model to generate an n-dimensional vector for each word.

We have no way of interpreting (that I know of) what features were selected or what the numbers in the vectors actually mean, but we'll see that they work anyway and similar words huddle up together.

 gensim provides a handy tool that can train a set of embeddings according to your corpus, but we have to 'Tag' them first as the model accepts a vector of TaggedDocument objects.

A tagged vector looks like this.

In [1]: vector_train_corpus[1]Out[1]: TaggedDocument(words=['my', 'friend', 'came', 'over', 'to', 'watch', 'critical', 'role', '.

'], tags=['TRAIN_0_1'])The Word2Vec module can train a dictionary of embeddings, given a vector of TaggedDocument objects.

Let’s see if our embeddings are any good.

In [2]: embeddings.


most_similar('exercise')Out [2]: [('weights', 0.

8415981531143188), ('diet', 0.

823296308517456), ('cash', 0.

8081855773925781), ('overtime', 0.

8048149943351746), ('savings', 0.

7981317639350891), ('routine', 0.

7969541549682617), ('exercising', 0.

7916312217712402), ('surveys', 0.

7907524108886719), ('survey', 0.

7893239259719849), ('workout', 0.

788111686706543)]We learnt some good correlations to ‘exercise’ like ‘diet’ and ‘workout’ but the rest aren’t good enough.

Anyway, this will do for now.

Visualizing the embeddingsWe cannot directly visualize high-dimensional data.

To see if our embeddings actually carry useful information, we need to reduce the dimensionality to 2 somehow.

There are two extremely useful techniques PCA (principal component analysis) and t-SNE (T-distributed stochastic neighboring entities) that do just this, flatten high-dimensional data into the best possible representation in the specified number of lower dimensions.

  t-SNE is a probabilistic method and takes a while to run, but we’ll try both methods for the 2000 most common words in our embeddings.

PCATo generate a dataframe with reduced dimensionsBokeh is an extremely useful library for interactive plots which has flown under the radar of quite a lot of people for a long time.

If you run the jupyter notebook locally, you can interact with the plot, hover around and see which word each dot represents.

To generate an interactive plot for vectors generated from PCAScatter plot for word-embeddings after applying PCAT-SNEPlot t-SNE using bokehNumbers as well as number names are grouped up together.

t-SNE usually does a better job showing more separated clusters, while PCA just bunched everything up in the middle in this example.

However, performance is dataset dependent and it never hurts to try both.

Dense networksFor our first model, we’ll try a very common approach to binary sentiment classification, for which we first need to calculate the Tf-Idf score of each word in our corpus.

Tf-idf stands for 'Term frequency – inverse document frequency'.

If you haven't heard of it, all it does is assign a weight to each word based on the frequency of its appearance in a corpus.

Words that appear often, like 'the', 'when' and 'very' will have a low score and the rarer ones, like 'tremendous', 'undergraduate' and 'publication', which might actually help us classify a sentence, will have a higher score.

This is a simple heuristic in order to better understand our data.

It is corpus specific and we can train one for the embedding vectors we generated.

The TfidfVectorizer class from sklearn makes quick work of it and we can fit one to our vectors as follows.

gen_tfidf = TfidfVectorizer(analyzer=lambda x: x, min_df=3)matrix = gen_tfidf.


words for sentence in vector_train_corpus])tfidf_map = dict(zip(gen_tfidf.

get_feature_names(), gen_tfidf.

idf_))The min_df parameter is a threshold for the minimum frequency.

In this case, we do not want to track the tf-idf score of a word that appears less than thrice in our corpus.

  Now, for every response object, we will create a vector of size 200 (the same dimension as our embedding vector).

This is our sentence-level embedding.

We will take the average of the embedding vectors of each token in each response and weight it by the tf-idf score of each word.

The embedding for the sentence "I went out for dinner" can be calculated as follows.

Encoding a sentence weighted by the tf-idf score.

The encode_sentence function adds up the vector of each token in a sentence, weighted by the tf-idf score and generates a vector of length 200 for each response.

To encode a sentence as explainedx_train = scale(np.

concatenate([encode_sentence(ele, 200)for ele in map(lambda x: x.

words, vector_train_corpus)]))x_val = scale(np.

concatenate([encode_sentence(ele, 200)for ele in map(lambda x: x.

words, vector_val_corpus)]))Let’s build a simple two layer dense net.

This is just to check if we have done everything correctly up to this point.

Let’s call this our zero’th model.

Dense-net on sequential data without transformations is a joke anyway right?Single layer dense networkIn [30]: model.

fit(x_train, y_train, epochs=10, verbose=1) score = model.

evaluate(x_val, y_val, verbose=1) print(score)Out [30]: [1.

416706955510424, 0.

4626334520158102]We get a loss of 1.

41 and a validation accuracy of 0.


This exact same model manages to get a validation score of about 0.

8 on binary sentiment analysis, but given the difference in complexity, hopefully you weren’t expecting much.

   Throwing in another dense layer doesn’t help either.

Double layer dense networkIn [31]: model.

fit(x_train, y_train, epochs=10, verbose=1) score = model.

evaluate(x_val, y_val, verbose=1) print(score)Out [31]: [1.

4088160779845103, 0.

4665603141855471]Unsurprisingly, the results are still pretty bad, as dense layers can not capture temporal correlations.

Recurrent networksA recurrent network using LSTM or GRU cells will surely solve the problem, but upon reading the documentation of keras.


LSTM you'll realize it expects an input batch shape of (batch_size, timesteps, data_dim).

Obviously it would want some data along the dimension of time as well, but our encoded vectors have a shape of (batch_size, data_dim).

 For our case, timesteps refers to the tokens.

Instead of averaging out the vectors of each response, we want to keep them as they are.

To fit our RNN, we can create a new way of encoding our tokens.

We will ignore the tf-idf scores altogether and expect the LSTM to find out whatever useful features it needs for itself over the epochs.

  There is just one more problem.

LSTMs expect same sized inputs for each sample, i.

e it wants all the sentences to have exactly the same number of words, which we will call the sequence length.

 To see what we're working with, here's a scatter-plot of the distribution of token lengths in our training set.

lengths = [len(token) for token in df_train.

tokens] plt.

scatter(lengths, range(len(lengths)), alpha=0.

2);In [39]: print(np.

mean(lengths), np.

max(lengths))Out [39]: 20.

543121372260245 1349The longest response was found out to be 1349 words long but the mean length was about 21 words.

You can do broadly two things here, set the sequence length equal to the number of words in the longest response you have found, but you don’t know how long the longest response in the test set might be and you might have to truncate anyway, or keep your sequence length close to the mean but just enough to not lose much data.

We’ll see better ways of handling long responses later.

Once we decide our sequence length, longer responses will be truncated and shorter responses will be padded with a vector of zeros (or a vector of the means along the transverse axis, but zeros work just fine).

 For now, I’ll use a sequence length of 80.

No specific reason.

sentence encoding for LSTMWe’re done here.

 Finally we can build our first recurrent neural network.

I’ll use the CuDNNLSTM class, which is astronomically faster than the LSTM class if you're on a GPU.

LSTM is so much slower that I don't have the patience to benchmark it for you.

 Additionally, let's use the functional API of keras instead of the .

add syntax for a change.

It is a lot more flexible.

This is our actual baseline model.

In [46]: model.

fit(x_train, y_train, epochs=10, verbose=1) score = model.

evaluate(x_val, y_val, verbose=1) print(score)Out [46]: [0.

571527066326153, 0.

8551969567037445]The loss now is 0.

57 and the validation accuracy is 0.

855, which is a great improvement, just as we expected.



BidirectionalIn the current state, our model can just remember the past.

It might benefit from a bit of context, maybe read a full phrase before sending an output to the next layer.

For example, “It was hilarious to see” and “It was hilarious to see how bad it was” mean very different things.

A bidirectional recurrent neural network (BRNN) overcomes this difficulty by propagating once in the forward direction and once in the backward direction and weighting them appropriately.

I don’t expect the score to increase much, as sentiment analysis doesn’t really need this structure.

Machine translation or handwriting recognition can make better use of bidirectional layers, but it never hurts to try.

In keras, you can just call Bidirectional with your existing layer.

However, Bidirectional LSTMs tend to overfit a bit, so I'll validate after each epoch, just to measure how much impact a bidirectional layer can potentially have.

It's a bit unfair to the previous models, but there won't be much improvement anyway.

In [50]: model.

fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10, verbose=1)The best validation accuracy was 0.

8640 at the end of epoch 5, a 1% improvement.

It’s not much, but we’ll take it.



EmbeddingThere is a slightly simpler way of doing this.

We can just add a keras Embedding layer and skip dealing with gensim altogether.

All the document-tagging, vector-building and training will be taken care of by keras.

We can skip tokenization as well, as the Tokenizer class in keras tokenizes everything in the way Embedding likes.

You can rerun this notebook upto the preprocessing section, so that your dataframe looks like thisMake sure your dataframe has the following columnsShuffle the datadf_train, df_val, y_train, y_val = train_test_split(df, y, test_size=0.

15, random_state=42)To tokenize sentencesThese are our new tokens, which are obviously not all the same length, so we’ll quickly pad them with zeros.

pad_sequences is a handy function to do just this.

To zero-pad sequencesWe’ll be using this two layer RNN extensively to benchmark different approaches.

The Embedding layer takes in a vocabulary size, the length of each word-vector, the input sequence length and a boolean that tells it whether it should train itself.

We set this to false if we're using embeddings from someone else, unless we're transfer-learning, or training from scratch.


tokens returns a list, but we need a numpy array, of numpy arrays as our training setx_train = np.


array(token) for token in df_train.


shape(46172, 80)model.

fit(x_train, y_train, epochs=10, verbose=1)We have to carry out the same transformations on the validation setx_val = np.


array(token) for token in df_val.


shape, y_val.

shape)(8149, 80) (8149, 7)score = model.

evaluate(x_val, y_val, verbose=1)print(score)[0.

47125363304474144, 0.

8988832985788697]Our validation score is good.

With half the work, we managed to get a slightly better model than the previous one, or is it because we have two LSTM layers this time?.The influences are compounded and it might not work out so well for the test set.

 However, if you train your own embeddings on a dataset this small, you’re likely to not generalize well on the test set.

Your real world accuracy might plummet further if you plan to use that model in production.

 To prevent this, you need to train on a larger dataset, but the 6 million parameters will soon be 6 billion parameters.

Besides, it might not be easy to collect more data if you’re solving a problem for a company.

Pre-trained embeddingsLet’s face it.

Nobody trains their own embeddings nowadays, unless your model needs to understand domain-specific language.

If you take somebody’s model, tweak it and call it your own, you’ll have better results in less time.

Using pre-trained models is part of transfer learning, where you try to create a ripoff of a great model to suit your dataset.

More specifically, there are two very commonly used open source embeddings that will outperform self-trained embeddings 95 out of 100 times.

There’s nothing special about it, they’re just high dimensional vectors trained on huge datasets, on hardware more powerful than anything you’ll ever own.

They give the best results for most NLP tasks.

 (Spoiler: No they don’t.

Even better embeddings were released last year.

We’ll get to that.

)GloVeGlobal Vectors for word representation is a suite of word embeddings trained on a billion tokens with a vocabulary of 400 thousand words.

These embeddings can be downloaded here From here onwards, we will use the keras Embedding layer as it is easier to work with.

We'll also use the keras Tokenizer class as it works well with Embedding.

 There is a major difference between keras.



Tokenizer and nltk.

word_tokenize, however.

Tokenizer returns a list of numbers, assigned according to frequency, instead of a list of words and internally maintains a vocabulary dictionary that maps words to numbers.

Restart your kernel and rerun upto the preprocessing section if you're running out of memory.

Now is a good time to shuffle the datasetdf_train, df_val, y_train, y_val = train_test_split(df, y, test_size=0.

15, random_state=42)We’ll use gensim to generate a dictionary of embeddings from the downloaded data, however the file you downloaded isn't in the format gensim likes.

Thankfully, there's a workaround for this by gensim themselves.

The glove2word2vec function converts the file into set of vectors.

We'll save this file in the same directory as the original.

To save the downloaded data in a format gensim can parse laterWe just want embeddings for words that are actually in our corpus.

Filter out the unwanted words and count the number of words that we don’t have embeddings for.

To generate an embedding matrix for words in our corpusWe still don’t have everything we need.

For multi class classification, tracking the accuracy is often misleading, especially if you have a class weight imbalance.

You can trivially get 90% accuracy on a dataset that has 90 positive samples and 10 negative samples by just predicting the mode, but the model will be pretty useless.

We should instead track the F1 score as well.

If you know what precision and recall is, you probably know what an f1-score is.

  Precision measures how many positive-predicted samples were actually positive.

 Recall measures how many actual positive samples were predicted to be positive.

  The F1 score is the harmonic mean of the two, which serves as a great metric for tracking your model’s progress.

  Unfortunately, the native f1-score metrics of keras was removed in version 2.

0, so we have to write our own.

Keras accuracy metrics expect vectors of target classes and predicted classes.

We can finally build our model using the Embedding class.

The weights will be initialized using the emb_matrix and trainable will be set to False.

Setting trainable to True usually gives slightly better results at the expense of ~6 million more trainable variables (corpus dependent).

Suit yourself.

Note: I will intentionally leave out GRUs throughout this notebook as LSTMs almost always work better in practice.

But you can try them out yourself.

Just replace LSTM with GRU, or CuDNNLSTM with CuDNNGRU if you're on a GPU.

We’ll use brnn_2 again.

The model definition will be the same, but the compile command will change to:model.

compile(optimizer=Adam(lr=1e-3), loss='categorical_crossentropy', metrics=['accuracy', f1])model.

fit will be called just like it was before.

Going through the same preprocessing pipeline for the validation set, we can evaluate our model.

In [31]: score = model.

evaluate(x_val, y_val, verbose=1) print(score)Out [31]: [0.

5425475006672792, 0.

8874708553342998, 0.

887631649655906]The validation score this time is 0.

88 and the f1-score is very similar, but pre-trained embeddings will almost certainly generalize better to the test set or real world data, and handle anomalies more effectively.

Word2VecGoogle released their pre-trained Word2Vec embeddings a few years ago.

It was trained on the Google News corpus of about 3 billion tokens.

You can download the vectors here.

Split the dataset, tokenize, pad with zeros, etc until you get what we had for the previous model.

This time, the downloaded embedding file is good enough for gensim to import directly.

The model and everything else is exactly the same and we'll still be tracking the F1-score.

Train brnn_2 exactly like last time, preprocess the validation set and evaluate the model.

In [34]: score = model.

evaluate(x_val, y_val, verbose=1) print(score)Out [34]: [0.

43286853496308475, 0.

8793717020639599, 0.

8801786997359521]The validation score is 0.

879 this time, which is a very small difference from the previous model and we can’t objectively say which model is better.

Word2Vec is usually slightly better than GloVe on most NLP applications, but this time it wasn’t.

We’ve stalledOver the last few models, our validation score has parked itself at about 0.

88, which leads us to think, is this the best accuracy we can reach?.Our training accuracies have almost always surpassed 96%, are we overfitting?.Or are we underfitting?.Maybe adding more layers interspersed with Dropout layers or other regularization will help?  For multi-class classification, if you have flatlined, the answers to these questions lie in the dataset.

This is where you should have a look at it.

Plot all charts that you think might be helpful and try to gain some insights.

Maybe plotting the confusion matrix for our last model will help.

In [36]: y_pred = model.

predict(x_val, verbose=1) 8149/8149 [==============================] – 3s 363us/stepIn [38]: print(y_pred.

shape, y_val.

shape) (8149, 7), (8149, 7)The confusion matrix can not handle one-hot vectors, let’s convert them into integer classes.

y_pred_class = np.


argmax(x) for x in y_pred])y_val_class = np.


argmax(x) for x in y_val])Confusion matrix of true classes against predicted classesIt classified ‘achievement’ and ‘affection’ pretty accurately, was horrible at classifying ‘nature’ and ‘exercise’ and pretty bad at everything else.

Our model was also somewhat confused between ‘achievement’ and ‘enjoy_the_moment’, which, if you think about it, would be the case even for a human sometimes.

 Right now, our model is basically an affection classifier.

The large discrepancy between accuracies of different classes is what stands out and it only means one thing.

Class imbalance.

Let’s plot a pie chart to see how bad it is.


figure(figsize=(7, 7))plt.



value_counts(), labels=classes);Pie chart of class distributionTurns out, it’s pretty bad!In [51]: labels.


value_counts()Out [51]: affection 18817 achievement 18250 bonding 5930 enjoy_the_moment 5839 leisure 3809 nature 1017 exercise 659 Name: sentiment, dtype: int64The smallest class, ‘exercise’ has about 3.

5% the number of samples as the largest class, ‘achievement’.

 Ideally you would want the exact same number of samples for all classes in your training set.

In practice, a little variance doesn’t hurt.

SamplingTo overcome this problem, there are a few things we can do, the first being sampling.

To balance our datasets, we can oversample instances of the minority class or undersample instances of the majority class.

Both come with their disadvantages however, which are more prominent in datasets with a greater imbalance, like ours.

Oversampling the minority overfits the model because of the high duplication, while undersampling might leave crucial information out.

A more powerful sampling method SMOTE, artificially generates new instances of the minority class by forming combinations of neighboring clusters, but this still doesn’t eliminate overfitting.

  We won’t try undersampling, as it would leave our training set with about 4500 samples, which is too small even for binary classification.

 Let’s try oversampling.

We’ll not make the number of samples exactly equal, but bring it within the same ballpark.

We’ll start afreshdf = pd.


csv')We need to first split our training and validation sets.

Since we normally wouldn’t augment our test set, we shouldn’t augment our validation set either.

df, df_val = train_test_split(df, test_size=0.

15, random_state=42)labels = df[['id', 'sentiment']]classes = sorted(labels.


unique())Let’s separate the dataframes by sentimentdfs = []for sentiment in classes: df_temp = df.


sentiment == sentiment) df_temp.

dropna(axis=0, inplace=True) dfs.

append(df_temp)ls = [len(df) for df in dfs]print(ls)[15580, 15917, 5059, 4964, 555, 3234, 863]pd.

concat([df] * int(max(lengths) / len(df)) generates a new dataframe with df replicated the required number of times.

 We can write a one-liner to generate a list of augmented dataframes.

new_dfs = [pd.

concat([df]*int(max(ls)/len(df)), ignore_index=True) for df in dfs]new_ls = [len(df) for df in new_dfs]print(new_ls)[15580, 15917, 15177, 14892, 15540, 12936, 15534]The new classes look pretty balanced.

Let’s concatenate everything into one large dataframedf = pd.

concat(new_dfs, ignore_index=True)labels = df[['id', 'sentiment']]print(df.

shape, len(labels))(105576, 5) 105576plt.

figure(figsize=(7, 7))plt.



value_counts(), labels=classes);Pie chart of class distribution after oversamplingLooks good.

Preventing overfitment is now our only goal.

Drop the n and sentiment columns from the dataframe, and one-hot encode the labels.

Lowercase all the responses and shuffle the dataset.

df_train, _, y_train, _ = train_test_split(df, y, test_size=0, random_state=42)print(df_train.

shape, y_train.


shape)(105576, 3) (105576, 7)(8149, 5)We’ll use the GoogleNews Word2Vec model to train on this set.

All the steps are exactly the same, but this will take longer to train because of the larger dataset, so let’s validate after each epoch and save a checkpoint after validation score increases.

We just need to prepare our validation set before we start training.

The ModelCheckpoint callback expects a file path, and a metric to monitor.

save_best_only was set to True to save us some disk space.

Additionally, I have also set the learning rate to decay by a factor of 1e-6 after each epoch as our model will overfit pretty quickly.

For a summary of the training epochs, refer to the notebook version of this postmodel.

fit(x_train, y_train, validation_data=[x_val, y_val], callbacks=[checkpoint], epochs=10, verbose=1)Training accuracy reached 99.

16%, but validation accuracy didn’t cross 90%.

Though this is the best result we got so far, we definitely did overfit.

Using the same dataset, we’ll now try to create a bigger model, but with more regularization, in an attempt to reduce overfitment.

Additionally, let’s use LeakyReLU activations as vanishing gradients can kill our ReLUs.

Graph for LeakyReLUIf you use LeayReLU as an activation function of a layer in keras, using model.

save() later will give you this error (at the time of writing this blog) AttributeError: 'LeakyReLU' object has no attribute '__name__' To fix this, you will have to use LeakyReLU as a layer.

 We'll use LeakyReLU with alpha = 0.

1 and additionally, Dropout will be used for regularization.

For a summary of the training epochs, refer to the notebook version of this postOur validation accuracy did not change much even though training accuracy crossed 98%.

The regularized model isn’t doing any better either, we overfit again due to the imbalance.

Let’s plot the confusion matrix for this model to see if anything changed.

But, if we run model.

predict now, we'll use the model object that was trained for the complete 10 epochs, not the one that gave us the highest validation accuracy.

To use the best one, we need to load it from our last checkpoint file.

Keras also requires us to define what custom objects have been used, for example, load_model doesn't know what f1 means.

model = load_model( 'D:/Datasets/mc-sent/models/w2v_balanced_v1.

hdf5', custom_objects={'f1': f1})y_pred = model.

predict(x_val, verbose=1)y_pred_class = np.


argmax(x) for x in y_pred])y_val_class = np.


argmax(x) for x in y_val])The new confusion matrixThe confusion matrix is hardly any different, so our model overfit after all.

The imbalance in this dataset is proving to be too difficult to combat.

 But there’s another, perhaps less stupid way of dealing with imbalance that we haven’t tried yet.

Cost-sensitive learningIn this method, we penalize misclassifications differently.

Misclassifications of the minority class are penalized more heavily than ones of the majority class, which means, the loss is different for each class.

Such a penalty system may induce the model to pay more attention to the minority class.

 Concretely, we calculate a class weight dictionary and feed it to the .

fit method during training and keras modifies the loss function accordingly.

Scikit-learn has a handy function to calculate class weights.

Starting afreshdf = pd.


csv')Now is a good time to split our training and validation sets.

df, df_val = train_test_split(df, test_size=0.

15, random_state=42)labels = df[['id', 'sentiment']]classes = sorted(labels.


unique()Let’s find our class weightsclass_weights = class_weight.

compute_class_weight('balanced', np.


sentiment)), labels.


42336329, 0.

4143997, 1.

30381498, 1.


88468468, 2.

03957947, 7.

64310545]We need to convert this into an enumerated dictionary for keras to be able to parse it.

In [9]: class_weight_dict = dict(enumerate(class_weights)) print(class_weight_dict)Out [9]: {0: 0.

4233632862644416, 1: 0.

41439969843563484, 2: 1.

3038149831982606, 3: 1.

3287671232876712, 4: 11.

884684684684684, 5: 2.

0395794681508965, 6: 7.

643105446118192}We can pass this dictionary to keras to change its loss function accordingly.


shape, labels.


shape)(46172, 5) (46172, 2)(8149, 5)Follow the same exact steps as before, using the balanced_relu model, and set trainable = True in the Embedding layer, for a change.

Prepare the validation set and compile the model just like we already did.

Before you call fit however, set the class_weight parameter like so.


fit(x_train, y_train, validation_data=[x_val, y_val], callbacks=[checkpoint], class_weight=class_weights, epochs=15, verbose=1)We’ve reached 91% validation accuracy!.It’s the first time we surpassed 90%.

There is one last thing I want us to try.

ELMo EmbeddingsThese are sentence-level embeddings, released by Allen NLP last year.

As per the inventors,ELMo is a deep contextualized word representation that models both complex characters of word use, and how these uses vary across linguistic contexts.

The word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.

These embeddings are available through the tensorflow hub API.

Since these embeddings are sentence-level, we don’t need to tokenize them.

Do no preprocessing on the dataset apart from dropping the n and sentiment columns.

In [18]: x_train = np.


array(sentence) for sentence in df_train.

response])In [19]: x_train[:5]Out [19]: array(['i fall in love with my girl friend', 'My friend came over to watch Critical Role.

', 'I thought that I was out of cream for coffee, but lo and behold, there remained a can in the back of the pantry!', 'I discovered a new software, Thunkable, that will help me develop an Android app faster than I originally thought.

', 'I found an alternate source of income.

'], dtype='<U6707')y_train might look something like this.

In [20]: y_train[:5]Out [20]: array([[0, 0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]], dtype=int8)We’ll have to write our own class inheriting the Layer class from keras and define a few mandatory functions.

Process the validation set just like the training set.

We’ll be using the same architecture as before, but let’s drop the fancy activation function this time.


fit(x_train, y_train, batch_size=8, validation_data=[x_val, y_val], callbacks=[checkpoint], class_weight=class_weights, epochs=5, verbose=1)I had to use a tiny batch_size to prevent running out of memory.

This takes hours to train and reached a validation score of 0.

9244 after just 5 epochs!.(Didn’t have enough time to train for more epochs, maybe next time)Using all the featuresWe have been using the response feature exclusively.

What if the period column helped improve the validation accuracy?.You can try this out yourself.

Convert the period column into two columns is_24h and is_3m, as period can only be one of these two values.

In other words, one-hot-encode it.

Your architecture will change respectively.

Two new Input layers will be added, and model.

fit will require 3 inputs instead of one.

Here’s what it would look like on one of the networks.

The call to model.

fit would also change to accommodate the three inputs.


fit([x_train, df_train.

24h, df_train.

3m], y_train, validation_data=([x_val, df_val.

24h, df_val.

3m], y_val), callbacks=[checkpoint], epochs=10, verbose=1)Time to go retrain everything!EnsembleFinally, after a long day of work, it was time to wind up.

I fired up a new terminal, read the test set ( p_test ) into memory, loaded ten of the best trained models from their checkpoint files, generated the prediction arrays, and combined them in two ways, basically creating a non-trainable ensemble.

Hard voting: Let’s say we have ten prediction vectors from ten different models for a single row in the test set.

We take the argmax of each vector, and predict the mode of those ten values.

This method is usually preferred as the prediction vectors do not interact with each other and correlation is minimum.

Majority wins and the classifiers that differ, are silenced.

Soft voting: In this, the ten prediction vectors are added up together element-wise and the argmax of the resulting vector is returned as the prediction.

This takes interactions into considerations and hence, total accuracy is not a lot greater than individual accuracies, but the ensemble is less likely to predict false positives.

Its a jury, where every opinion matters.

Soft voting gave me a final accuracy of 94.

11% and hard voting got me to 94.


The notebook for this post and related files can be found here.

My follow-up post will cover capsule networks, attention networks, Fasttext and BERT for this same problem statement, when I manage to complete studying those.

If you’ve reached here, I have done a good job.

Hopefully you learnt something new and so did I.

Thank you for your time!.. More details

Leave a Reply