NLP Learning Series: Part 3 — Attention, CNN and what not for Text Classification

Let us say we have a sentence and we have maxlen = 70 and embedding size = 300.

We can create a matrix of numbers with the shape 70×300 to represent this sentence.

For images, we also have a matrix where individual elements are pixel values.

Instead of image pixels, the input to the tasks is sentences or documents represented as a matrix.

Each row of the matrix corresponds to one-word vector.

Convolution Idea: While for an image we move our conv filter horizontally as well as vertically, for text we fix kernel size to filter_size x embed_size, i.


(3,300) we are just going to move vertically down for the convolution taking look at three words at once since our filter size is 3 in this case.

This idea seems right since our convolution filter is not splitting word embedding.

It gets to look at the full embedding of each word.

Also one can think of filter sizes as unigrams, bigrams, trigrams, etc.

Since we are looking at a context window of 1,2,3, and 5 words respectively.

Here is the text classification network coded in Pytorch:And for the Keras enthusiasts:I am a big fan of Kaggle Kernels.

One could not have imagined having all that compute for free.

You can find a running version of the above two code snippets in this kaggle kernel.

Do try to experiment with it after forking and running the code.

Also please upvote the kernel if you find it helpful.

The Keras model and Pytorch model performed similarly with Pytorch model beating the keras model by a small margin.

The Out-Of-Fold CV F1 score for the Pytorch model came out to be 0.

6609 while for Keras model the same score came out to be 0.


I used the same preprocessing in both the models to be better able to compare the platforms.


BiDirectional RNN(LSTM/GRU):TextCNN works well for Text Classification.

It takes care of words in close range.

It can see “new york” together.

However, it still can’t take care of all the context provided in a particular text sequence.

It still does not learn the sequential structure of the data, where every word is dependent on the previous word.

Or a word in the previous sentence.

RNN help us with that.

They can remember previous information using hidden states and connect it to the current task.

Long Short Term Memory networks (LSTM) are a subclass of RNN, specialized in remembering information for an extended period.

Moreover, the Bidirectional LSTM keeps the contextual information in both directions which is pretty useful in text classification task (But won’t work for a time series prediction task as we don’t have visibility into the future in this case).

For a most simplistic explanation of Bidirectional RNN, think of RNN cell as a black box taking as input a hidden state(a vector) and a word vector and giving out an output vector and the next hidden state.

This box has some weights which are to be tuned using Backpropagation of the losses.

Also, the same cell is applied to all the words so that the weights are shared across the words in the sentence.

This phenomenon is called weight-sharing.

Hidden state, Word vector ->(RNN Cell) -> Output Vector , Next Hidden stateFor a sequence of length 4 like “you will never believe”, The RNN cell gives 4 output vectors, which can be concatenated and then used as part of a dense feedforward architecture.

In the Bidirectional RNN, the only change is that we read the text in the usual fashion as well in reverse.

So we stack two RNNs in parallel, and hence we get 8 output vectors to append.

Once we get the output vectors, we send them through a series of dense layers and finally a softmax layer to build a text classifier.

In most cases, you need to understand how to stack some layers in a neural network to get the best results.

We can try out multiple bidirectional GRU/LSTM layers in the network if it performs better.

Due to the limitations of RNNs like not remembering long term dependencies, in practice, we almost always use LSTM/GRU to model long term dependencies.

In such a case you can think of the RNN cell being replaced by an LSTM cell or a GRU cell in the above figure.

An example model is provided below.

You can use CuDNNGRU interchangeably with CuDNNLSTM when you build models.

(CuDNNGRU/LSTM are just implementations of LSTM/GRU that are created to run faster on GPUs.

In most cases always use them instead of the vanilla LSTM/GRU implementations)So here is some code in Pytorch for this network.

Also, here is the same code in Keras.

You can run this code in my BiLSTM with Pytorch and Keras kaggle kernel for this competition.

Please do upvote the kernel if you find it helpful.

In the BiLSTM case also, Pytorch model beats the keras model by a small margin.

The Out-Of-Fold CV F1 score for the Pytorch model came out to be 0.

6741 while for Keras model the same score came out to be 0.


This score is around a 1–2% increase from the TextCNN performance which is pretty good.

Also, note that it is around 6–7% better than conventional methods.


Attention ModelsDzmitry Bahdanau et al first presented attention in their paper Neural Machine Translation by Jointly Learning to Align and Translate but I find that the paper on Hierarchical Attention Networks for Document Classification written jointly by CMU and Microsoft in 2016 is a much easier read and provides more intuition.

So let us talk about the intuition first.

In the past conventional methods like TFIDF/CountVectorizer etc.

we used to find features from the text by doing a keyword extraction.

Some word is more helpful in determining the category of a text than others.

However, in this method we sort of lost the sequential structure of the text.

With LSTM and deep learning methods, while we can take care of the sequence structure, we lose the ability to give higher weight to more important words.

Can we have the best of both worlds?The answer is Yes.

Actually, Attention is all you need.

In the author’s words:Not all words contribute equally to the representation of the sentence meaning.

Hence, we introduce attention mechanism to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vectorIn essence, we want to create scores for every word in the text, which is the attention similarity score for a word.

To do this, we start with a weight matrix(W), a bias vector(b) and a context vector u.

The optimization algorithm learns all of these weights.

On this note I would like to highlight something I like a lot about neural networks — If you don’t know some params, let the network learn them.

We only have to worry about creating architectures and params to tune.

Then there are a series of mathematical operations.

See the figure for more clarification.

We can think of u1 as nonlinearity on RNN word output.

After that v1 is a dot product of u1 with a context vector u raised to exponentiation.

From an intuition viewpoint, the value of v1 will be high if u and u1 are similar.

Since we want the sum of scores to be 1, we divide v by the sum of v’s to get the Final Scores,sThese final scores are then multiplied by RNN output for words to weight them according to their importance.

After which the outputs are summed and sent through dense layers and softmax for the task of text classification.

Here is the code in Pytorch.

Do try to read through the pytorch code for attention layer.

It just does what I have explained above.

Same code for Keras.

Again, my Attention with Pytorch and Keras Kaggle kernel contains the working versions for this code.

Please do upvote the kernel if you find it useful.

This method performed well with Pytorch CV scores reaching around 0.

6758 and Keras CV scores reaching around 0.


This score is more than what we were able to achieve with BiLSTM and TextCNN.

However, please note that we didn’t work on tuning any of the given methods yet and so the scores might be different.

With this, I leave you to experiment with new architectures and playing around with stacking multiple GRU/LSTM layers to improve your network performance.

You can also look at including more techniques in these network like Bucketing, handmade features, etc.

Some of the tips and new techniques are mentioned here on my blog post: What my first Silver Medal taught me about Text Classification and Kaggle in general?.

Also, here is another Kaggle kernel which is my silver-winning entry for this competition.

ResultsHere are the final results of all the different approaches I have tried on the Kaggle Dataset.

I ran a 5 fold Stratified CV.


Conventional Methods:b.

Deep Learning Methods:PS: Note that I didn’t work on tuning the above models, so these results are only cursory.

You can try to squeeze more performance by performing hyperparams tuning using hyperopt or just old fashioned Grid-search.

ConclusionIn this post, I went through with the explanations of various deep learning architectures people are using for Text classification tasks.

In the next post, we will delve further into the next new phenomenon in NLP space — Transfer Learning with BERT and ULMFit along with their intuition.

Follow me up at Medium or Subscribe to my blog to be informed about my next post.

Also if you want to learn more about NLP here is an excellent course.

You can start for free with the 7-day Free Trial.

Let me know if you think I can add something more to the post; I will try to incorporate it.

Cheers!!!Originally published at mlwhiz.

com on March 9, 2019.


. More details

Leave a Reply