Build Your First Text Classification model using PyTorch

Overview Learn how to perform text classification using PyTorch Understand the key points involved while solving text classification Learn to use Pack Padding feature Introduction I always turn to State of the Art architectures as my first submission in the hackathon(s).

Implementing the State of the Art architectures has become quite easy.

Thanks to the deep learning frameworks.

Frameworks provide an easy way of implementation with least knowledge of concepts and coding skills.

In short, it’s a goldmine for a data scientist like me! There are a number of deep learning frameworks that we can use to accomplish our tasks, each of them having there own pros and cons.

One such popular deep learning framework is PyTorch, which is well known for fast computational power.

So in this article, we will walk through the key points for solving the text classification problem.

And then we will implement our first text classifier in PyTorch! Note: I highly recommend to go through the below article before moving forward with this article.

A Beginner-Friendly Guide to PyTorch and How it Works from Scratch Table of Contents 1.

Why PyTorch for Text Classification? Dealing with Out of Vocabulary words Handling variable length sequences Wrappers and Pre-trained models 2.

Understanding the Problem Statement 3.

Implementation – Text Classification Why PyTorch for Text Classification? Before we dive deeper into the technical concepts, let us quickly familiarize ourselves with the framework that we are going to use – PyTorch.

The basic unit of PyTorch is Tensor, similar to the “numpy” array in python.

There are a number of benefits for using PyTorch but the two most important are – Dynamic networks – Change in the architecture during the run time Distributed training across GPUs I am sure you are wondering – why should we use PyTorch for working with text data? Let us discuss about some incredible features of PyTorch that makes it different from other frameworks, specially while working with text data.


Dealing with Out of Vocabulary words A text classification model is trained on a fixed vocabulary size.

But during inference, we might come across some words which are not present in the vocabulary.

These words are known as Out of Vocabulary words and most Deep Learning frameworks lack the ability to handle the Out of Vocabulary words.

This is a critical issue and could even result in the loss of information.

In order to handle the Out Of Vocabulary words, PyTorch supports a cool feature that replaces the rare words in our training data with Unknown token.

This in turn helps us in tackling the problem of Out of Vocabulary words.

Apart from handling Out Of Vocabulary words, PyTorch also has a feature that can handle sequences of variable length! 2.

Handling variable length sequences Have you heard of how Recurrent Neural Network is capable of handling variable-length sequences? Ever wondered how to implement it? PyTorch comes with a useful feature  ‘Packed Padding sequence‘ that implements Dynamic Recurrent Neural Network.

Padding is a process of adding an extra token called padding token at the beginning or end of the sentence.

As the number of the words in each sentence vary, we convert the variable length input sentences into sentences with the same length by adding padding tokens.

Padding is required since most of the frameworks support static network, i.


the architecture remains same throughout the model training.

Although padding solves the issue of variable length sequences, but there is another problem with this idea – the architectures now process these padding token like any other information/data.

Let me explain this through a simple diagram- As you can see in the diagram (below), the last element, which is a padding token is also used while generating the output.

This is taken care of by the Packed Padding sequence in PyTorch.

  Packed padding ignores the input timesteps with padding token.

These values are never shown to the Recurrent Neural Network which helps us in building a dynamic Recurrent Neural Network.


Wrappers and Pretrained models The state of the art architectures are being launched for PyTorch framework.

Hugging Face released Transformers which provides more than 32 state of the art architectures for the Natural Language Understanding Generation! Not only this, PyTorch also provides pretrained models for several tasks like Text to Speech, Object Detection and so on, which can be executed within few lines of code.

Incredible, isn’t it? These are some really useful features of PyTorch among many others.

Let us now use PyTorch for a text classification problem.

Understanding the problem statement As a part of this article, we are going to work on a really interesting problem.

Quora wants to keep track of insincere questions on their platform so as to make users feel safe while sharing their knowledge.

An insincere question in this context is defined as a question intended to make a statement rather than looking for helpful answers.

To break this down further, here are some characteristics that can signify that a particular question is insincere: Has a non-neutral tone Is disparaging or inflammatory Isn’t grounded in reality Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers The training data includes the question that was asked, and a flag denoting whether it was identified as insincere (target = 1).

The ground-truth labels contain some amount of noise, i.


they are not guaranteed to be perfect.

Our task will be to identify if a given question is ‘insincere’.

You can download the dataset for this from here.

It is time to code our own text classification model using PyTorch.

Implementation – Text Classification Let us first import all the necessary libraries required to build a model.

Here is a brief overview of the packages/libraries we are going to use- Torch package is used to define tensors and mathematical operations on it TorchText is a Natural Language Processing (NLP) library in PyTorch.

This library contains the scripts for preprocessing text and source of few popular NLP datasets.

  View the code on Gist.

  The In order to make the results reproducible, I have specified the seed value.

Since Deep Learning model might produce different results each when it is executed due to the randomness in it, it is important to specify the seed value.

View the code on Gist.

Pre-processsing data: Now, let us see how to preprocess the text using field objects.

There are 2 different types of field objects – Field and LabelField.

Let us quickly understand the difference between the two- Field: Field object from data module is used to specify preprocessing steps for each column in the dataset.

LabelField: LabelField object is a special case of Field object which is used only for the classification tasks.

Its only use is to set the unk_token and sequential to None by default.

Before we use Field, let us look at the different parameters of Field and what are they used for.

Parameters of Field: Tokenize: specifies the way of tokenizing the sentence i.


converting sentence to words.

I am using spacy tokenizer since it uses novel tokenization algorithm Lower: converts text to lowercase batch_first: The first dimension of input and output is always batch size Next we are going to create a list of tuples where first value in every tuple contains a column name and second value is a field object defined above.

Furthermore we will arrange each tuple in the order of the columns of csv, and also specify as (None,None) to ignore a column from a csv file.

Let us read only required columns – question and label fields = [(None, None), (text,TEXT),(label, LABEL)] In the following code block I have loaded the custom dataset by defining the field objects.

View the code on Gist.

Let us now split the dataset into training and validation data View the code on Gist.

  Preparing input and output sequences: The next step is to build the vocabulary for the text and convert them into integer sequences.

Vocabulary contains the unique words in the entire text.

Each unique word is assigned an index.

Below are the parameters listed for the same Parameters: 1.

min_freq: Ignores the words in vocabulary which has frequency less than specified one and map it to unknown token.


Two special tokens known as unknown and padding will be added to the vocabulary Unknown token is used to handle Out Of Vocabulary words Padding token is used to make input sequences of same length Let us build vocabulary and initialize the words with the pretrained embeddings:   View the code on Gist.

Now we will prepare batches for training the model.

BucketIterator forms the batches in such a way that a minimum amount of padding is required.

View the code on Gist.

  Model Architecture: It is now time to define the architecture to solve the binary classification problem.

The nn module from torch is a base model for all the models.

This means that every model must be a subclass of the nn module.

I have defined 2 functions here: init as well as forward.

Let me explain the use case of both of these functions- 1.

Init: Whenever an instance of a class is created, init function is automatically invoked.

Hence, it is called as a constructor.

The arguments passed to the class are initialized by the constructor.

We will define all the layers that we will be using in the model 2.

Forward: Forward function defines the forward pass of the inputs.

Finally, let’s understand in detail about the different layers used for building the architecture and their parameters- 1.

Embedding layer: Embeddings are extremely important for any NLP related task since it represents a word in a numerical format.

Embedding layer creates a look up table where each row represents an embedding of a word.

The embedding layer converts the integer sequence into a dense vector representation.

Here are the two most important parameters of the embedding layer – Parameters: num_embeddings: represents the no.

of unique words in dictionary embedding_dim: represent N dimensional vector representation of a word 2.

LSTM: LSTM is a variant of RNN that is capable of capturing long term dependencies.

Following the some important parameters of LSTM that you should be familiar with Parameters: input_size  :  Represents no.

of input dimensions hidden_size :  Represents no.

of hidden nodes num_layers  :  Represents no.

of layers to be stacked batch_first  : If True, then the input and output tensors are provided as (batch, seq, feature) dropout: If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout.

Default: 0 bidirection: If True, introduces a bi directional lstm Linear Layer: Linear layer refers to dense layer.

The two important parameters here are described below Parameters: in_features : Represents no.

of input features out_features: Represents no.

of hidden nodes in a hidden layer Pack padding: As already discussed, pack padding is used to define the dynamic recurrent neural network.

Without pack padding, the padding inputs are also processed by the rnn and returns the hidden state of the padded element.

This an awesome wrapper that does not show the inputs that are padded.

It simply ignores the values and returns the hidden state of the non padded element.

  Now that we have a good understanding of all the blocks of the architecture, let us go to the code! I will start with defining all the layers of the architecture: View the code on Gist.

The next step would be to define the hyperparameters and instantiate the model.

Here is the code block for the same: View the code on Gist.

Let us look at the model summary and initialize the embedding layer with the pretrained embeddings View the code on Gist.

Here I have defined the optimizer, loss and metric for the model: View the code on Gist.

There are 2 phases while building the model: Training phase: model.

train() sets the model on the training phase and activates the dropout layers.

Inference phase: model.

eval() sets the model on the evaluation phase and deactivates the dropout layers.

Here is the code block to define a function for training the model View the code on Gist.

So we have a function to train the model, but we will also need a function to evaluate the mode.

Let’s do that View the code on Gist.

Finally we will train the model for a certain number of epochs and save the best model every epoch.

View the code on Gist.

Let us load the best model and define the inference function  that accepts the user defined input and make predictions View the code on Gist.

Amazing! Let us use this model to make predictions for few questions: View the code on Gist.

End Notes We have seen how to build our own text classification model in PyTorch and learnt the importance of pack padding.

You can play around with the hyperparameters of the LSTM model and try to improve accuracy even further.

Some of the hyperparameters to tune can be the number of LSTM layers, number of hidden units in each LSTM cell and so on.

If you have any queries/feedback, leave the comments section.

I will get back to you.

      You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).


Leave a Reply