Transfer Learning Intuition for Text Classification

xxmaj should i wait for the xxup comedk result or am i supposed to apply before the result ?,xxbos xxmaj what is it really like to be a nurse practitioner ?,xxbos xxmaj who are entrepreneurs ?,xxbos xxmaj is education really making good people nowadays ?y: LMLabelList,,,,Path: .

;Test: NoneThe tokenized prepared data is based on a lot of research from the FastAI developers.

To make this post a little bit complete, I am sharing some of the tokens definition as well.

xxunk is for an unknown word (one that isn’t present in the current vocabulary)xxpad is the token used for padding, if we need to regroup several texts of different lengths in a batchxxbos represents the beginning of a text in your datasetxxmaj is used to indicate the next word begins with a capital in the original textxxup is used to indicate the next word is written in all caps in the original textb) Finetune Base Language Model on Task Specific DataThis task is also pretty easy when we look at the code.

The specific details of how we do the training is what holds the essence.

The paper introduced two general concepts for this learning stage:Discriminative fine-tuning:The Main Idea is: As different layers capture different types of information, they should be fine-tuned to different extents.

Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates.

In the paper, the authors suggest first to finetune only the last layer, and then unfreeze all the layers with a learning rate lowered by a factor of 2.


Slanted triangular learning rates:According to the authors: “For adapting its parameters to task-specific features, we would like the model to quickly converge to a suitable region of the parameter space in the beginning of training and then refine its parameters”The Main Idea is to use a high learning rate at the starting stage for increased learning and low learning rates to finetune at later stages in an epoch.

After training our Language model on the Quora dataset, we should be able to see how our model performs on the Language Model task itself.

FastAI library provides us with a simple function to do that.

# check how the language model performs learn.

predict("What should", n_words=10)—————————————————————'What should be the likelihood of a tourist visiting Mumbai for'c) Finetune Base Language Model Layers + Task Specific Layers on Task Specific DataThis is the stage where task-specific learning takes place that is we add the classification layers and fine-tune them to perform our current task of text classification.

The authors augment the pretrained language model with two additional linear blocks.

Each block uses batch normalization and dropout, with ReLU activations for the intermediate layer and a softmax activation that outputs a probability distribution over target classes at the last layer.

The params of these task-specific layers are the only ones that are learned from scratch.

Here also the Authors have derived a few novel methods:Concat Pooling:The authors use not only the concatenation of all the hidden state but also the Maxpool and Meanpool representation of all hidden states as input to the linear layers.

Gradual Unfreezing:Rather than fine-tuning all layers at once, which risks catastrophic forgetting(Forgetting everything we have learned so far from language models), the authors propose to gradually unfreeze the model starting from the last layer as this contains the least general knowledge.

The Authors first unfreeze the last layer and fine-tune all unfrozen layers for one epoch.

They then unfreeze the next lower frozen layer and repeat, until they finetune all layers until convergence at the last iteration.

The function slice(2e-3/100, 2e-3) means that we train every layer with different learning rates ranging from max to min value.

One can get the predictions for the test data at once using:test_preds = np.



Test, ordered=True)[0])[:,1]I am a big fan of Kaggle Kernels.

One could not have imagined having all that compute for free.

You can find a running version of the above code in this kaggle kernel.

Do try to experiment with it after forking and running the code.

Also please upvote the kernel if you find it helpful.

Results:Here are the final results of all the different approaches I have tried on the Kaggle Dataset.

I ran a 5 fold Stratified CV.


Conventional Methods:b.

Deep Learning Methods:c.

Transfer Learning Methods(ULMFIT):The results achieved were not very good compared to deep learning methods, but I still liked the idea of the transfer learning approach, and it was so easy to implement it using fastAI.

Also running the code took a lot of time at 9 hours, compared to other methods which got over in 2 hours.

Even if this approach didn’t work well for this dataset, it is a valid approach for other datasets, as the Authors of the paper have achieved pretty good results on different datasets — definitely a genuine method to try out.

PS: Note that I didn’t work on tuning the above models, so these results are only cursory.

You can try to squeeze more performance by performing hyperparameter tuning using hyperopt or just old fashioned Grid-search.

Conclusion:Finally, this post concludes my NLP Learning series.

It took a lot of time to write, but the effort was well worth it.

I hope you will find it helpful in your work.

I will try to write some more on this topic when I get some time.

Follow me up at Medium or Subscribe to my blog to be informed about my next posts.

Also if you want to learn more about NLP here is an excellent course.

You can start for free with the 7-day Free Trial.

Let me know if you think I can add something more to the post; I will try to incorporate it.

Cheers!!!Originally published at mlwhiz.

com on March 30, 2019.

.. More details

Leave a Reply