Transfer Learning in NLP for Tweet Stance Classification

Transfer Learning in NLP for Tweet Stance ClassificationA comparison of two transfer learning methods in Natural Language Processing: “ULMFiT” and the “OpenAI Transformer” for a multi-class classification task involving Twitter dataPrashanth RaoBlockedUnblockFollowFollowingJan 15A wordcloud of an interesting Tweet dataset analyzed in this post2018 has been a hugely exciting year in the field of Natural Language Processing (NLP), in particular, for transfer learning — a technique where instead of training a model from scratch, we use models pre-trained on a large dataset and then fine-tune them for specific natural language tasks.

Sebastian Ruder provides an excellent account of the past and current state of transfer learning in his post “NLP’s ImageNet moment has arrived”, explaining why this is such a hot field in NLP right now — his post is a must read.

In recent times, methods such as ULMFiT, the OpenAI Transformer, ELMo and Google AI’s BERT have revolutionized the field of transfer learning in NLP by using language modelling during pre-training, which has significantly improved on the state-of-the-art for a variety of tasks in natural language understanding.

It can be argued that the use of language modelling (which is not without its limitations) is one of the main reasons computers have shown great improvements in their semantic understanding of language.

One interesting aspect of the transfer learning methods mentioned above is that they use language models pre-trained on well-formed, massive curated datasets that include full sentences with a clear syntax (such as Wikipedia articles and the 1 billion word benchmark).

The natural question that arises is — how well can such pre-trained language models generalize to natural language tasks from a different distribution, such as Tweets?Goal of this postIn this post we will discuss and compare two modern transfer learning approaches — ULMFiT and the OpenAI Transformer, and show how they can be fine-tuned with relative ease to perform classification tasks from a different distribution — in this case, classifying the stance of Tweets towards a target topic.

We will aim to develop a modelling approach that can help answer the following questions:Does our fine-tuned language model (and classifier) generalize to the unstructured and messy language syntax of Tweets?Can we achieve reasonable accuracy (comparable to a benchmark result from 2016) with minimal task-specific customization for each model and with limited computing resources?How do the classification results vary based on the model architecture used?All the code and data for the results shown below are available in this GitHub repo.

Why ULMFiT and the OpenAI Transformer?These two particular transfer learning methods were chosen for this project because they are very similar in terms of how they use language modelling to perform unsupervised pre-training, followed by a supervised fine-tuning step (they are both semi-supervised methods).

But they are also different in that they use different network architectures to achieve generalization.

ULMFiT uses a 3-layer bi-LSTM architecture, while the OpenAI approach uses a transformer network.

A full description of ULMFiT and the OpenAI Transformer is too long a topic for this post, but this article does an excellent job of highlighting the technical details of both model architectures and why they matter, so please do read it!BackgroundBefore going into the details of transfer learning for Tweet stance classification, let’s clarify some terminology to understand why transfer learning has so drastically improved state-of-the-art for a variety of natural language tasks in recent times.

Moving on from word embeddingsHistorically, pre-trained word embedding techniques such as word2vec and GloVe, were heavily used in NLP to initialize the first layer of a neural network before training for a new task.

These are shallow representations (a single layer of weights, known as embeddings).

Any prior knowledge from the word embeddings are only present in the first layer of the network — the entire network would still need to be trained from scratch for a new target task.

To derive meaning from sequences of words (such as those seen in natural language), models that utilize word embeddings would still need tremendous amounts of data to disambiguate large sets of words and “learn” from a completely new and unseen vocabulary.

As shown in a benchmark result discussed later in this post, the amount of data required for transferring knowledge through word embeddings can be huge, which can result in very large computational costs.

Throughout 2018, the advent of powerful pre-trained language models have shown that is possible to gain a much deeper understanding of language semantics and structure for new tasks, especially for long sequences, using the knowledge gained from pre-training on large text corpora.

Language ModellingA language model attempts to learn the structure of natural language through hierarchical representations, and thus contains both low-level features (word representations) and high-level features (semantic meaning).

A key feature of language modelling is that it is generative, meaning that it aims to predict the next word given a previous sequence of words.

It is able to do this because language models are typically trained on very large datasets in an unsupervised manner, and hence the model can “learn” the syntactic features of language in a much deeper way than word embeddings.

In his post, Sebastian Ruder does a very elegant job of highlighting why language modelling is so powerful for a broad range of NLP tasks.

Unlabelled language data is relatively easy to obtain (it is freely available in the form of large text corpora), so by feeding a language model a sufficiently large dataset, it is now possible to perform unsupervised pre-training on billions of words while incorporating a deeper knowledge of language syntax.

Semi-supervised learningTransfer learning in NLP is now typically done as a multi-step process— where an entire network is first pre-trained in an unsupervised manner with a language modelling objective.

Following this, the model is then fine-tuned on a new task using a supervised approach (with some labelled data), which can then be used for tasks such as text classification.

This combination of unsupervised pre-training (using language modelling) followed by supervised fine-tuning is termed as semi-supervised learning, and is the approach used to solve our Tweet stance classification problem in this post.

Dataset UsedThe Tweet dataset used in this post comes from this SemEval 2016 shared task, which contains Tweets that pertain to the following five topics :AtheismClimate change is a concernFeminist movementHillary ClintonLegalization of abortionThe labelled data provided consists of a target topic, a Tweet that pertains to it, and stance of the Tweet towards the target.

The data is already split into a training set (containing 2,914 Tweets) and a test set (containing 1,249 Tweets).

The stance can be one of three labels: “FAVOUR”, “AGAINST” and “NEITHER”, hence this is a multi-class dataset.

Example training data (randomly sampled)The challenge of detecting stanceStance detection is a subcategory of opinion mining, where the task is to automatically determine whether the author of a piece of text is in favour or against a given target.

Consider the following two Tweets:We don’t inherit the earth from our parents we borrow it from our childrenLast time I checked, Al Gore is a politician, not a scientist.

To a human observer, it is reasonably clear that both Tweets are relevant to the topic of climate change, and that each expresses a particular stance towards the topic of climate change.

However, to a machine, detecting this stance is a difficult problem on multiple fronts.

The informal and unstructured syntax of Tweets combined with the fact that machines lack proper contextual awareness and historical knowledge that humans have (for example, knowing who Al Gore is), makes this a challenging problem for machine learning algorithms.

In addition, gathering large amounts of labelled data for Tweet stance in order to train a machine learning algorithm is expensive and tedious.

It is becoming more and more necessary to develop deep learning methods that can work with limited amounts of training data and still yield useful insights.

Stance and sentiment are not the same!Stance detection is related to, but not the same as sentiment analysis.

In sentiment analysis, we are interested in whether a piece of text is positive, negative, or neutral based on just the content of the language used.

Typically, for sentiment analysis, the choice of positive or negative language correlates with the overall sentiment of the text.

However, the stance of a piece of text is defined with respect to a target topic, and can be independent of whether positive or negative language was used.

The target (topic towards which opinion is expressed) may or may not be mentioned directly in the actual text, and any entities mentioned in the text may or may not be the actual target of opinion.

Below is an example from the task creators’ paper.

Topic: legalization of abortionTweet: The pregnant are more than walking incubators.

They have rights too!Stance: FavourIn the above example, since the topic is phrased as “legalization of abortion”, the Tweet can be interpreted as being in favour of the topic.

Had it been phrased as “Pro-life movement”, its stance would have been against the topic.

It is clear from this example that the language used in the Tweet is only loosely positive in its sentiment; however this sentiment has no bearing on whether it is in favour of, or against the topic.

Class ImbalanceThe below table shows a breakdown of how many Tweets pertain to each topic in the dataset.

The 2,914 Tweets are distributed unequally per topic, and there is significant variation in the number of Tweets belonging to each class for each topic.

To explore the distribution and inspect the Tweets in more detail, take a look at the fully interactive visualization provided by the task creators.

Source: SemEval-2016 Stance DatasetLooking at this distribution, the stance classification task appears quite challenging — not only is the dataset small (a couple of thousand training samples in total, ranging from a minimum of 395 to a maximum 689 Tweets per topic) — but there is also a significant class imbalance in the samples.

For example, the topic “climate change is a concern” has a larger percentage of training samples classified as “favour”, and a very small percentage (less than 4%) classified as “against”.

On the other hand, the topic “atheism” has a much larger fraction of its samples classified as “against”.

Any modelling approach for stance classification must be able to capture this class imbalance, both between and within the target classes.

Evaluation Metric UsedTo evaluate classification performance, the creators of the task use a macro-averaged F-score, which is the harmonic mean of precision and recall for the two main classes “FAVOUR” and “AGAINST”.

An example of how precision and recall are used in sentiment classification is given here.

In general, precision is at odds with recall, and hence the F-score provides a good way to gain insights into a classifier’s performance.

Although we do not include the third class “NEITHER” in the evaluation, it is implicitly accounted for, since the system has to correctly predict all three classes to avoid being penalized heavily in either of the first two.

To evaluate the F-score, a perl script (whose usage will be described in later sections) was provided by the task creators— all we need to do is shape our stance classifier’s prediction output in a way that can be read by the evaluation script.

The macro-averaged F-score obtained can then be compared with other models’ results.

Benchmark Result for Comparison: MITREThe winning entry for this task in 2016 was from team MITRE, who describe their classification approach in this paper.

To detect stance, MITRE used a Recurrent Neural Network (RNN) model organized into 4 layers as shown in the below image.

The first layer contained one-hot-encoded tokens (i.


words from the text) that were projected through a 256-embedding layer called the “projection layer”.

The sequence of outputs were then fed into a “recurrent layer” (containing 128 LSTM units), whose output was then connected to a 128-dimensional layer of Rectified Linear Units (ReLUs) with 90% dropout.

The final output layer was a 3-dimensional softmax layer representing each output class: FAVOUR, AGAINST and NONE.

Image credit: MITRE’s submission to SemEval 2016MITRE also applied transfer learning (via the use of word embeddings) to reuse prior knowledge of Tweet syntax from a pre-trained model.

This was accomplished through a multi-step pre-training process as described below:Pre-training the projection layer: The weights for the projection layer were initialized from 256-dimensional word embeddings learned using the word2vec skip-gram method.

To do this, MITRE extracted 218,179,858 English Tweets from Twitter’s public streaming API and then performed weakly supervised learning on this unlabelled dataset (after cleaning the Tweets and lower-casing them).

To learn the meaning of compound phrases, they then applied word2phrase to identify phrases comprised of up to four words.

Out of the 218 million Tweets sampled, 537,366 vocabulary items were used (that appeared at least 100 times in the corpus).

Pre-training the recurrent layer: The second layer of MITRE’s network, which consisted of 128 LSTM units, was initialized with weights pre-trained using distant supervision of a hashtag prediction auxiliary task.

To begin, 197 hashtags were automatically identified that were relevant to the five topics under consideration (using a nearest-neighbour search of the word embedding space).

Then, 298,973 Tweets (out of the total 218 million) were extracted that contained at least one of these 197 hashtags, and the network was trained to tune the word embeddings and the recurrent layer.

Like before, the Tweets were lower-cased, stripped of all hashtags and phrase-chunked before tokenization.

Results: MITREThe F-scores on unseen test data for each topic and class obtained by MITRE are shown below.

Note that for the topic class “Climate change (AGAINST)”, the F-score is zero, meaning that their model did not predict any Tweets from the test set as being against climate change.

However, just 3.

8% of the training data from the climate change topic was of the class “AGAINST” (which amounts to just 15 training samples!), so it makes sense that the model might be lacking context for this particular class due to the sheer lack of training samples.

Data source: MITRE’s paper submitted to SemEval 2016As mentioned in their paper, MITRE obtained a macro F-score (averaged across all topics) of 0.


This was the best score among all 19 participants who submitted their results to the competition in 2016.

The approach used by MITRE was a rather elaborate multi-step pre-training procedure using word embeddings that required the use of very large unlabelled datasets (hundreds of millions of samples), significant cleaning of the raw data and separate pre-training steps for each layer in the network.

The multitude of hand-crafted steps was mainly because of the limitations of word2vec embeddings (which was the dominant approach to pre-training models in NLP at the time).

Method 1: ULMFiTULMFiT has been entirely implemented in v1 of the fastai library (see fastai.

text on their GitHub repo).

Version 1 of fastai is built on top of PyTorch v1, so having some knowledge of PyTorch objects is beneficial to get started.

In this post, we cover some of the techniques that fastai has developed that make it very convenient to do transfer learning, even for someone with relatively little deep learning experience.

Training StepsAs described in the original paper, ULMFiT consists of three stages.

Training the language model on a general-domain corpus that captures high-level natural language featuresFine-tuning the pre-trained language model on target task dataFine-tuning the classifier on target task dataImage source: ULMFiT paper by Jeremy Howard and Sebastian RuderWe only perform steps 2 and 3 during Tweet stance classification.

Step 1 is an unsupervised pre-training step, and is really computationally expensive — which is why the models have been made available publicly by fastai so others can benefit from their work.

We bank on the pre-trained language model’s ability to capture long-term dependencies in any target text (in English) that we might encounter.

All the code for this section is available in a Jupyter notebook [ulmfit.


For brevity, only the key elements of the approach are discussed in this post — feel free to look through the full notebook and this project’s main GitHub repo for a deep-dive into the working code for classifying Tweet stance.

Novel learning techniques in ULMFiTThe following novel techniques from the ULMFit paper are what allow it to generalize well even on unseen data from a different distribution.

It is recommended to read the full paper for a deeper understanding, but a summary is given below.

Discriminative Fine-tuning: Each layer of the model captures different types of information.

Hence, it makes sense to fine-tune each layer’s learning rates differently, and this is done in ULMFiT based on extensive empirical testing and implementation updates.

It was empirically found that first fine-tuning only the last layer (with the others frozen), and then unfreezing all the layers and applying a learning rate lowered by a factor of 2.

6 for all other layers during language model fine-tuning worked well in most cases.

1-cycle learning rate policy: In the fine-tuning stage, we apply 1-cycle learning rates, which comes from this report by Leslie Smith.

It is a modification of the cyclical learning rate policy, which has been around for a long time, but the 1-cycle policy allows a larger initial learning rate (say max_LR = 1e-03), but decreases it by several orders of magnitude just at the last epoch.

This seems to provide greater final accuracy.

Note that this doesn’t mean we run it for one epoch — the ‘1’ in 1-cycle means it just cycles the learning rate down one epoch before the max epochs that we specify.

In the ULMFiT implementation, this 1-cycle policy has been tweaked and is referred to as slanted triangular learning rates.

Gradual unfreezing: During classification, rather than fine-tuning all the layers at once, the layers are “frozen” and the last layer is fine-tuned first, followed by the next layer before it, and so on.

This avoids the phenomenon known as catastrophic forgetting (by losing all prior knowledge gained from the language model).

Concatenated pooling: Pooling is a component of neural networks to aggregate the learned features and reduce the overall computational burden of a large network.

In case you’re curious, a good introduction to pooling as applied in LSTMs is given in this paper.

In ULMFiT, because an input text can consist of hundreds or thousands of words, information might get lost if we only consider the last hidden state in the LSTM.

To avoid this information loss, the hidden state at the last time step is concatenated with both the max-pooled and mean-pooled representation of the hidden states over as many time steps as can fit in GPU memory.

ULMFiT’s language modelULMFiT’s pre-trained language model was trained on the Wikitext 103 dataset by Stephen Merity.


ai provides an API where this pre-trained model (along with some standard datasets for testing) can be conveniently and easily loaded for any target task before fine-tuning.

The main thing to note about Wikitext 103 is that it consists of a pre-processed subset of 103 million tokens extracted from Wikipedia.

The dataset retains the original case (it was not lower-cased before training the language model), and all punctuation and numbers are included.

The type of text data included in this dataset includes sentences from full Wikipedia articles, so the hope is that the language model is able to capture and retain some longer-term dependencies from relatively complex sentences.

Pre-process Tweet data for ULMFiTThe original Tweets can contain some arbitrary non-English characters, so we take care of this before loading in the language model.

Note that this only removes non-English characters, but does not do any other kind of aggressive pre-processing (like lower-casing or removing entire words or hashtags).

We retain the entirety of the information in the raw data and let the language model do the heavy lifting.

The ULMFiT framework implemented in fast.

ai works very well with Pandas DataFrames, so all data is read in using pandas.

read_csv and stored as a DataFrame.

train_orig = pd.

read_csv(path/trainfile, delimiter=' ', header=0, encoding = "latin-1")We then make sure that the Tweets only contain ASCII characters by applying a simple cleanup function.

def clean_ascii(text): # function to remove non-ASCII chars from data return ''.

join(i for i in text if ord(i) < 128)train_orig['Tweet'] = train_orig['Tweet'].

apply(clean_ascii)Training data stored in a Pandas DataFrameULMFiT requires just the stance and the text data (i.


Tweets) for the language-model finetuning and classification steps, hence we store these in the relevant DataFrame and write the clean data out to a csv file.

train = pd.

concat([train_orig['Stance'], train_orig['Tweet']], axis=1)# Write train to csvtrain.


csv', index=False, header=False))Clean training data ready to be loaded into ULMFiTLanguage model fine-tuning in ULMFiTThis is the first stage of training, where we use the pre-trained language model weights (Wikitext 103) and fine-tune it with just the provided training data of 2,914 Tweets.

The published Jupyter notebook (ulmfit.

ipynb) describes in detail the pre-processing that is performed under-the-hood by fastai when using ULMFiT.

The reader is directed to look at the Jupyter notebook for more code and API details.

In our case, we specify a minimum word frequency of 1 for our language model fine-tuning step, which tells ULMFiT to only tokenize words in the Tweets that appear more than once with a unique token — all the words that appear once are given the tag <unk> during tokenization.

For a very detailed historical account of how and why tokenization is done this way, this fastai course documentation page contains some very useful information.

data_lm = TextLMDataBunch.

from_csv(path, 'train.

csv', min_freq=1)Tokenized and tagged Tweets using ULMFiT + spaCyOn viewing the tokenized Tweets, we can see that they look markedly different from their original form.

The tokenization technique used by fastai.

text is quite advanced and obtained after months of development by Jeremy Howard and the fastai team, and thus uses quite a few tricks to capture semantic meaning from the text.

Note that we are not converting the text to lowercase and removing stopwords (which was a common pre-tokenization approach in NLP until recently) — this would result in a tremendous loss of information that the model could instead use to gather an understanding of the new task's vocabulary.

Instead, a number of added tags are applied to each word as shown above so that minimal information is lost.

All punctuation, hashtags and special characters are also retained.

For example, the xxmaj token [source] indicates that there is capitalization of the word.

"The" will be tokenized as "xxmaj the".

Words that are fully capitalized, such as "I AM SHOUTING", are tokenized as "xxup i xxup am xxup shouting".

The method still uses spaCy's underlying tokenizer (including a multi-thread wrapper around spaCy to speed things up), but adds tags in a very smart way .

This balances capturing semantic meaning with reducing the number of overall tokens — so it is both powerful and efficient.

For a full list of all the token tags generated by ULMFiT’s fastai implementation, see the source code here.

Find the optimum learning rate: We define a learner object that uses the tokenized language model data, that is organized into batches for the GPU, and feed it a pre-trained language model as follows.


train [source] provides a convenient utility to search through a range of learning rates to find the optimum one for our dataset.

The idea is that our optimization function needs to use a learning rate that is at least an order of magnitude below the point at which the loss starts to diverge.

learn = language_model_learner(data_lm, pretrained_model=URLs.

WT103_1, drop_mult=0.


lr_find(start_lr=1e-8, end_lr=1e2)learn.


plot()Result of the “learning rate finder” (lr_find) as implemented in fastai.

trainApplying discriminative fine-tuning as per the ULMFiT paper, we run the language model fine-tuning step until the validation loss drops to a low value (close to 0.


learn = language_model_learner(data_lm, pretrained_model=URLs.

WT103_1, drop_mult=0.

5)# Run one epoch with lower layers learn.

fit_one_cycle(cyc_len=1, max_lr=1e-3, moms=(0.

8, 0.

7))# Run for many epochs with all layers unfrozenlearn.


fit_one_cycle(cyc_len=20, max_lr=1e-3, moms=(0.

8, 0.

7))This fine-tuned encoder layer’s weights are then saved for use during the classification stage.

# Save the fine-tuned encoderlearn.

save_encoder('ft_enc')Fine-tuning the classifierThis step involves creating a classifier object [source] that can predict a class label once we re-train the model as a classifier.

The same network structure is still used for this task — the output layer is defined in a way that takes into account the number of classes we want to predict in our data.

Find optimum learning rate for the classifier: Just as before, the lr_find method is run like before to find an optimum learning rate for the classifier.

# Classifier databunchdata_clas = TextClasDataBunch.

from_csv(path, 'train_topic.

csv', vocab=data_lm.


vocab, min_freq=1, bs=32)# Classifier learner learn = text_classifier_learner(data_clas, drop_mult=0.




lr_find(start_lr=1e-8, end_lr=1e2)learn.


plot()Result of the “learning rate finder” (lr_find) as implemented in fastai.

trainCarefully train the classifier: During classification, we first define a classifier learner object, and gradually unfreeze layers while running for one epoch each time as per the ULMFiT paper’s suggestion.

This helps us obtain a better classification accuracy than if we were to aggressively train all the layers at once.

learn = text_classifier_learner(data_clas, drop_mult=0.




fit_one_cycle(cyc_len=1, max_lr=1e-3, moms=(0.

8, 0.



fit_one_cycle(1, slice(1e-4,1e-2), moms=(0.




fit_one_cycle(1, slice(1e-5,5e-3), moms=(0.




fit_one_cycle(4, slice(1e-5,1e-3), moms=(0.


7))The validation loss for the classifier is much higher than for the language model, which could possibly be linked with the dataset.

Since there is significant class imbalance, a significant number of labels during training are expected to be predicted incorrectly.

Specific to this Twitter dataset, this problem could be more severe than what we see in the state-of-the-art examples (such as IMDb).

Predict stance using the classifierWe read in the test set and store it in a Pandas DataFrame as shown.

test = pd.

read_csv(path/testfile, delimiter=' ', header=0, encoding="latin-1")test = test.

drop(['ID'], axis=1)We can then apply our classifier learner’s predict method to predict the stance.

test_pred['Stance'] = test_pred['Tweet'].

apply(lambda row: str(learn.

predict(row)[0]))# Output to a text file for comparison with the gold referencetest_pred.


txt', sep=' ', index=True, header=['Target', 'Tweet', 'Stance'], index_label='ID')Evaluate our prediction from ULMFiTThe perl script provided by the task creators is run on the output and gold reference file (also provided by the creators) to produce a macro F-score that we can then compare with the benchmark result from MITRE.

cd eval/perl eval.

pl gold.

txt predicted.

txtBest obtained result using ULMFiTBest model parameters using ULMFiTThe best average macro F-score of 0.

65 using ULMFiT across all topics was obtained using the below approach and parameters:Fine-tuning the language model on an augmented Twitter vocabulary: this was done by downloading the Twitter sentiment140 dataset from Kaggle, and feeding a subset of its vocabulary (200,000 words) to the language model fine-tuning step.

The steps for this are shown in detail the Jupyter notebook ulmfit.


The full Kaggle Twitter dataset has 1.

6 million Tweets, which would have taken several hours to fine-tune the language model on (even on an NVIDIA P100 GPU) — so only a subset of 200,000 words were used to augment the language model fine-tuning step.

Training 5 distinct classifiers (i.


a separate training task for each topic during classification) and then combining the outputs for comparison with the gold reference— this was a technique similar to that used by MITRE in their best result from 2016, and is explained in their paper.

An optimum learning rate of 1e-03 for the language model fine-tuning stepAn optimum learning rate in the range of 1e-05 to 1e-03 for the classifier, with gradual unfreezingMethod 2: OpenAI TransformerThe OpenAI transformer, as described in their paper, is an adaptation of the well-known transformer from Google Brain’s 2017 paper “Attention is All You Need”.

Image credit: Google Brain’s “Attention is All You Need” paperWhile the original version from Google Brain used an identical encoder-decoder 6-layer stack, the OpenAI transformer uses a 12-layer decoder-only stack.

Each layer has two sub-layers, consisting of a multi-head self-attention mechanism, and a fully connected (position-wise) feed-forward network.

A full description of the transformer architecture used by OpenAI for transfer learning is given in the paper.

All the code for this section is available in a Jupyter notebook [transformer.


Just as before, only the key elements of the model are discussed in this post for brevity — feel free to look through the full notebook and this project’s main GitHub repo for a deep-dive into the working code for classifying the Tweets.

Training StepsThe following steps are used to train the OpenAI transformer:Unsupervised pre-training: The transformer language model was trained in an unsupervised manner on a few thousand books from the Google Books corpus and the pre-trained weights are made publicly available on the OpenAI GitHub repo for others’ benefit.

Supervised fine-tuning: We can adapt the parameters to the supervised target task.

The inputs are passed through the pre-trained model to obtain the final transformer block’s activation.

The first step (unsupervised pre-training) is very expensive, and was done by OpenAI (who trained the model for a month on 8 GPUs!) — thankfully, we can use the downloaded pre-trained model weights and proceed directly to the supervised fine-tuning step.

PyTorch Implementation of the OpenAI TransformerNOTE: While the original OpenAI transformer was implemented in TensorFlow [GitHub source] — for this project we used this PyTorch port of the OpenAI transformer, thanks to the amazing work by the kind folks at HuggingFace.

This is so that we can make a more consistent comparison with the fastai implementation of ULMFiT (which is also PyTorch-based), not to mention the ease of maintaining and distributing the code using one single framework.

Novel Techniques in the OpenAI TransformerTo perform out-of-domain target tasks, the transformer includes language modelling as an additional objective to the fine-tuning, which helps generalized learning as described in their paper.

This auxiliary language modelling objective is specified with a weighting parameter as shown below.

Where L1, L2 and L3 are the likelihoods for the language modelling objective, task-specific objective and combined objectives respectively.

The transformer has been proven to be a very powerful language modelling tool, especially in machine translation, thanks to its “self-attention mechanisms”.

A very intuitive and visual explanation of how the transformer is suited to capturing longer-range linguistic structure (using masked self-attention) is explained in this article.

A key point to note is that the OpenAI transformer used a 12-layer decoder-only structure, with 12 attention heads and a 768-dimensional state.

Another key point is that each minibatch was capable of sampling a maximum of 512 contiguous tokens, which, according to the OpenAI authors allows the transformer to achieve a much longer-range context than LSTM-based approaches.

Task-specific input transformations:OpenAI designed their transformer to be task-agnostic and be able to generalize to a range of natural language tasks.

To accomplish this, they allow the definition of custom “task-specific heads” as per the above image.

The task-specific head acts on top of the base transformer language model, and is defined in the DoubleHeadModel class in model_pytorch.

py(see the GitHub repo).

The PyTorch port of the OpenAI transformer originally written by HuggingFace was for a multiple choice classification problem (ROCStories).

For this Tweet stance detection task, we use the guidelines mentioned in the OpenAI paper to write a custom input transform for a classification task-head, such that we pad every text (representing each Tweet, in our case) with a start symbol and tokenize them for input to the encoder layer.

This is done as follows.

Note that in the classification input transform function above, all we need to do is specify the start token to the encoder, and then append the text to this start token.

We maintain the same tensor dimensionalities for each of the other variables as the original PyTorch code.

Pre-process Tweet data for the transformerThe dataloader to the training script is defined as follows.

Just as in the case of ULMFiT, we clean the data to remove any non-ASCII characters (to avoid issues during the encoding step).

The data is then stored as Numpy arrays.

To feed the data to the classification transform, we split it into training, validation and test sets using scikit-learn’s train_test_split utility.

Using the above code, we store the training Tweet (trX) alongside its numericalized stance (trY).

We feed in the data to the PyTorch dataloader in train_stance.

py as Numpy arrays (not DataFrame columns).

Below is an example of the input data once it is shaped as required by the transformer.

The stance “0” corresponds to “AGAINST”, “1” to “FAVOUR” and 2 to “NEITHER”.

Just as in ULMFiT, no information from the raw data is removed — we rely on the language model objective to identify syntactic relationships between unseen words (once the model fine-tunes).

Plenty of stem cells without baby smashing, by the way.

#SemST 0Is there a breeze i can catch Lakefront or will I die of a heat stroke there as well?.#heatstroke #SemST 1Road to #Paris2015 "#ADP2015 co-chairs' new tool will be presented on July 24th" @manupulgarvidal at @UN_PGA event on #SemST 2Are the same people who are red faced and frothing over abortion also against the death penalty?.Just wondering.

#deathpenalty #SemST 1DID YOU KNOW: The 2nd Amendment is in place 'cause politicians ignore the #CONSTITUTION.

#PJNET #SOT #tcot #bcot #ccot #2ndAmendment #SemST 0I Appreciate almighty God for waking me up diz beautiful day + giving me brilliant ideas to grow my #Hustle #SemST 0Speaking from the heart is rarely wise especially if you have a retarded heart like most feminists.

#GamerGate #SemST 0@Shy_Buffy welcome little sister!.Love you!.#SemST 2@GregAbbott_TX which god?.Yours?.not mine.

oh wait i don't have one.

#LoveWins #SemST 1Fine-tune the language model and classifierIn the OpenAI transformer, the language model and classifier fine-tuning are both done simultaneously, thanks to its parallelized architecture using multi-headed attention.

This makes it very easy to run the training loop and hence a number of experiments were possible.

The below command was used to run the training loop for 3 epochs, which is what was used for all our experiments.

See the Jupyter notebook (transformer.

ipynb) and the file train_stance.

py for more details on the default arguments and the various experiments run.

python3 train_stance.

py –dataset stance –desc stance –submit –data_dir .

/data –submission_dir default –n_iter 3The output of the training script is fed to another script parse_output.

py, which shapes the output in a way that can be evaluated by the perl script as provided by the task creators.

Evaluate our transformer’s predictionThe evaluation perl script is run on the output and gold reference file (also provided by the creators) to produce a macro F-score that we can then compare with the benchmark result from MITRE.

cd eval/perl eval.

pl gold.

txt predicted.

txtBest obtained result using the OpenAI TransformerBest model parameters using the OpenAI TransformerThe best macro F-score of 0.

69 obtained from the transformer across all topics was obtained using the below approach and parameters:Fine-tuning the language model just the provided training Tweets (2,914 of them).

The transformer was able to quickly generalize to the Tweet data in 3 epochs, and was able to obtain a good macro F-score without having to augment the input data in any way.

Training one single classifier for all topics at once (i.


running the training loop on the entire training dataset) — it was noticed that when trying to train the transformer on just a single topic (which had < 500 training samples), there was significant over-fitting where the validation accuracy dropped to well below 70%.

This could be because the transformer has a high-dimensional embedding layer (768 dimensions) that requires sufficient amount of training data to avoid over-fitting.

A language modelling weighting function (lambda) of 0.

5 as per the OpenAI paper.

Dropouts of 0.

1 on all layers (including the classification layer), once again as per the OpenAI paper.

Overall, fine-tuning the dropout, the weight of the language modelling objective and changing some of the other default arguments (such as random seed) did little to improve the macro F-score.

In general, the transformer was very quick to produce good results and required only some basic customization of the task-head input transforms.

Analysis of ResultsIn this section we compare the F-scores obtained by our two approaches: ULMFiT and the transformer with the benchmark result by MITRE.

Overall F-scoreWhen we consider the macro F-scores for stance across all five topics at once, the OpenAI transformer is shown to produce the best result.

Best results compared with the benchmark from MITREWhat is remarkable about this result is that both ULMFiT and the OpenAI transformer utilize pre-trained language models trained on a very different distribution (Wikipedia and Google Books respectively) whereas MITRE used embeddings pre-trained on a massive Twitter dataset (similar distribution).

It is clear that utilizing language modelling as a training signal during pre-training as well as fine-tuning has its advantages.

Even though Tweets are far more informal in their syntax and have shorter sequence lengths than the average sentence from a book corpus, the pre-trained models in either case are able to generalize and make predictions based on some syntactic understanding of the target data.

Topic-wise F-scoreThe below image compares the F-score (FAVOUR) and F-score (AGAINST) from our two approaches with MITRE’s, this time on a per-topic basis.

Looking at these results, it is once more clear that the OpenAI transformer clearly outperforms ULMFiT on most topics, for either class, which explains the higher overall F-score.

Comparison of topic-wise F-scores from ULMFiT and the OpenAI transformer vs.

 MITRE’sThe transformer does really well on all topics, regardless of whether it is predicting the “FAVOUR” or “AGAINST” class.

One very interesting observation is that both ULMFiT and MITRE produce zero predictions for “Climate change is a concern (AGAINST)” whereas the transformer is able to make some correct predictions for this case.

On inspecting the training data further (shown earlier in this post), this is the same class for which we have very few training samples (just 15 of them!).

It is remarkable that with such few training examples, the transformer is able to achieve some context and still assign some predictions to this class.

Judging by its good performance in the many other widely varying classes, this doesn’t seem like a fluke!Cross-tabulated Results: Ours vs.

Gold ReferenceWe count the total predictions made by both models in each class and cross-tabulate it against the labels in the gold reference set (i.


the “absolute truth”) provided by the task creators.

In these tables, the elements along the main diagonal indicate the correctly predicted labels and the off-diagonal elements show where the classifier went astray.

It is clear that both methods predict most classes reasonably well, with ULMFiT making more overall correct predictions in the “AGAINST” class, and the transformer making more correct predictions in the “FAVOUR” and “NEITHER” classes.

We can look at the training data distribution to see if we can draw additional inferences.

In the training data, there are far more Tweets belonging to the “AGAINST” class (image below).

The transformer seems to be able to achieve a broader overall context across even the minority classes, even when there are very few training samples per topic!Distribution of classes (overall) from original training dataEffect of Augmentation VocabularyULMFiT produced relatively poor overall average F-scores (of less than 0.

55) when we fine-tuned the language model on just the original training set (of 2,914 Tweets).

Only when we augmented the language model with 200,000 Tweets from the Kaggle Sentiment140 dataset did ULMFiT produce overall F-scores that were comparable to MITRE’s benchmark results.

The language model fine-tuning of the bi-LSTMs in ULMFiT does seem to require more data to generalize in this case than the transformer does.

The transformer did not require an augmentation vocabulary during the fine-tuning step.

The fact that it is able to achieve good generalization across even the minority classes seems to show that it achieves a better general understanding of Twitter-syntax in the presence of limited training samples.

All this seems very interesting and worthy of further study over more diverse datasets.

Effect of Sequence LengthIt is well known that ULMFiT produces state-of-the-art accuracy on various text classification benchmark datasets such as IMDb and AG News.

The common theme across all these benchmark datasets is that they have really long sequence lengths (some of the reviews/news articles are hundreds of words long), so clearly, the language model used by ULMFiT fine-tunes really well for long sequences.

Tweets, on the other hand, have a hard-limit of 140 characters, which are rather small sequences compared to full sentences from a movie review or a news article.

As described in the ULMFiT paper, the classifier uses “concatenated pooling” to help identify context in long sequences when a document contains hundreds or thousands of words.

To avoid loss of information in case of really long sequences, the hidden state at the last time step is concatenated with both the max-pooled and mean-pooled representation of the hidden states, as shown below.

Concatenated pooling: H is the vector of all hidden statesIn this study, no changes were made to the model architecture for either method— however, it could be that changing the representation of the hidden state using just one of the two (max or mean) representations could help ULMFiT generalize better to smaller sequences such as this case with Tweets.

The transformer in its original form seems to have no trouble generalizing to Tweet syntax even though it was pre-trained on a book corpus that also had possibly long sequences.

It could be that the transformer’s self-attention mechanisms and high-dimensional self-attention layers are capable of adapting to varying sequence lengths while learning aspects of Tweet syntax better than the LSTMs with concat pooling in the hidden layer.

These are profound concepts and could be studied in greater detail.

Effect of Language Model TuningIn ULMFiT, language model fine-tuning is a necessary step before training the classifier.

Including an augmented Tweet vocabulary during the language model fine-tuning step seemed to provide the model a better understanding of Tweet syntax, which seemed to improve its performance.

Varying the learning rates, momentum and dropouts did have a minor, but negligible overall impact on the F-score during fine-tuning.

In the OpenAI transformer, the language model is fine-tuned simultaneously with the classifier using a weighting parameter, through the auxiliary language model objective specified as per the below equation.

When the language model fine-tuning objective was turned off (i.


lambda=0) in the case of the transformer, the averaged macro F-score became markedly worse (below 0.


We can thus reason that the pre-trained model’s weights on their own do not sufficiently capture the syntax of Tweets, and that language model fine-tuning does help our model generalize better to Twitter data.

Increasing lambda to a very high value (5) or a very low value (0.

1) also did not improve the F-scores in our experiments, so there could be something specific to our Twitter data that makes the lambda value of 0.

5 an optimum one.

It would be interesting to see whether other values of the LM coefficient (lambda) are favourable when applied on a completely different dataset.

Effect of Training Data SizeULMFiT was able to produce good classification performance on each individual topic (for which we had less than 500 samples to train on).

In fact, our best result using ULMFiT was obtained (albeit with some data augmentation) by training five distinct classifiers on a per-topic basis.

This makes sense because ULMFiT has been shown to produce excellent transfer learning performance on very limited training samples (as few as 100!).

Hence, in cases where we have a very small number of labelled training samples, ULMFiT would be a good option to attempt transfer learning, at least for classification tasks.

The transformer on the other hand, requires at least a few thousand training samples to generalize well and avoid over-fitting.

The OpenAI paper shows good results for a range of tasks which have anywhere from 5,000 to 550,000 training samples.

The transformer has 786 dimensions in the self-attention heads and 3,072 dimensional inner states in the feed-forward networks; hence when we have fewer than 1,000 training samples, the model with its high dimensionality seems to memorize the training data and massively over-fit the data (with validation accuracy dropping well below 70%).

Effect of pre-training language model and architectureBoth language models were pre-trained on different text corpora (Wikitext 103 and Google Books).

Also, ULMFiT used 3-layer bi-LSTM architecture whereas OpenAI used a transformer network.

This Tweet by Yoav Goldberg on the topic provides some food for thought.

Based on our Tweet stance classification results, it appears that the transformer seems to have a benefit when it comes to learning case-specific syntax with limited context from a relatively small number of training samples — this could be more to do with the transformer architecture than the pre-trained language model used.

However, until a more consistent language model is available on both architectures, it is relatively hard to tell which of these had a bigger impact on the results.

ConclusionsIn this project, we studied the techniques used in two powerful transfer learning approaches (ULMFiT and the OpenAI transformer) for a novel task from a different distribution (stance detection of Tweets).

We developed a training and classification pipeline for two separate PyTorch-based frameworks and compared the macro F-scores for stance evaluation with the best results for this task from MITRE in 2016.

Both methods achieved good performance and good generalization with minimal customization of the model, and fine-tuning was achievable in reasonable time (about an hour overall for either method on a single Tesla P100 GPU).

Both methods have also been shown to make transfer learning very easy and achievable with a relatively small learning curve and relatively few additional lines of code.

While we were able to obtain better F-scores with the OpenAI Transformer across nearly all topics in this case, it does not imply that the transformer is always a better tool for such classification tasks.

This could depend very much on the nature of the data and the language models used, as well as the model architecture.

There could be many better ways to fine-tune either model for classification, and better hyper-parameter selection (or model customization) could help further improve the results — it just wasn’t possible to try them all for this project.

It appears that the task-agnostic and highly parallelized architecture of the transformer allows it to easily achieve rapid generalization in just 2–3 epochs of training for this Tweet stance task; however, the transformer is prone to over-fitting, especially when we have fewer than a thousand training samples.

ULMFiT is definitely much better at achieving a good performance with very small datasets (of fewer than 500 training samples); however it seems to require augmenting the vocabulary of the training data during the language model fine-tuning stage (considering that this specific Tweet classification task contained data that was very different from the pre-trained language model).

Overall, this is a very exciting time to be studying deep learning in NLP, and the advent of powerful transfer learning techniques such as these will hopefully open up deep learning applications to a much wider group of practitioners in the near future!AcknowledgementsThis work was done as part of a final course project in which my colleagues Andrew and Abhishek contributed significantly to idea generation as well as data cleaning and experimentation code.

They (and myself included) would welcome any feedback/comments.

Also, if you liked this article, please connect with me on LinkedIn and Twitter!Note on Hardware UsedAll notebooks and code shown in the GitHub repo were run on a machine that had an NVIDIA GPU, and the entire training process took roughly 1 hour (for ULMFiT) and under an hour (for the transformer) on a P100 GPU— running the same code on a pure CPU-machine can take orders of magnitude longer!.. More details

Leave a Reply