When Multi-Task Learning meet with BERT

When Multi-Task Learning meet with BERTIntroduction to Multi-Task Deep Neural Networks for Natural Language UnderstandingEdward MaBlockedUnblockFollowFollowingMar 2BERT (Devlin et al.

, 2018) got the state-of-the-art result in 2018 in multiple NLP problems.

It leveraged transformer architecture to learn contextualized word embeddings such that those vectors represent a better meaning in different domain problems.

To extend the usage of BERT, Liu et al.

proposed Multi-Task Deep Neural Networks (MT-DNN) to achieve the state-of-the-art result in multiple NLP problems.

BERT helped to build a shared text representation in MT-DNN while the fine-tuning part is leveraging multi-task learning.

This story will discuss about Multi-Task Deep Neural Networks for Natural Language Understanding (Liu et al.

, 2019) and the following are will be covered:Multi-task LearningDataArchitectureExperimentMulti-task LearningMulti-task learning is one of the transfer learning.

When learning knowledge from multiple things, we do not need to learn everything from scratch but we can apply knowledge learned from other tasks to shorten the learning curve.

Photo by Edward Ma on UnsplashTaking ski and snowboard as an example, you do not need to spends lots of time to learn snowboard if you already master ski.

It is because both sports shares some skill and you just need to understand the different part is ok.

Recently, I heard from friends that he was master in snowboard.

He only spent 1 months to master ski.

Go back to data science, researchers and scientists believe that transfer learning can be applied when learning text representation.

GenSen (Sandeep et al.

, 2018) demonstrated multi-task learning improved the sentence embeddings.

Part of text representation can be learned from different tasks and those shared parameters can be propagate back to learn a better weights.

DataInput is a word sequence which can be a single sentence or combing two sentence into together with a separator.

Same as BERT, sentence(s) will be tokenize and transforming to initial word embeddings, segment embeddings and position embeddings.

After that multi bidirectional transformer will be used to learn the contextual word embeddings.

The different part is leveraging multi-task to learn text representation and applying it to individual task in fine-tuning stage.

Architecture of MT-DNNMT-DNN has to go though two stages to train the model.

First stage includes pre-training of Lexicon Encoder and Transformer Encoder.

By following BERT, both encoders are trained by masked language modeling and next sentence prediction.

Second stage is fine-tuning part.

mini batch base stochastic gradient descent (SGD) is applied.

Different from single task learning, MT-DNN will compute the loss across different task and applying the change to the model in the same time.

Training Procedure of MT-DNN (Liu et al.

, 2019)The loss is difference across different task.

For classification task, it is binary classification problem so crossentropy loss is used.

For text similarity task, mean square error is used.

For ranking task, negative log likelihood is used.


6 for classification (Liu et al.

, 2019)Eq.

6 for regression (Liu et al.

, 2019)Eq.

6 for ranking (Liu et al.

, 2019)From below architecture figure, the shared layers are transferring text to contextual embedding via BERT.

After the shared layers, It will go though different sub-flow per to learn representation per specific task.

The task specific layers are trained for specific task problems such as single sentence classification and pairwise text similarity.

Architecture of MT-DNN (Liu et al.

, 2019)ExperimentMT-DNN is based on PyTorch implementation of BERT and the hyperparametes are:Optimizer: AdamaxLearning rate: 53–5Batch size: 32Maximum epoch: 5Dropout rate: 0.

1GLUE test set result (Liu et al.

, 2019)SNLI and SciTail result (Lit et al.

, 2019)Take AwayEven similar architecture (i.


BERT), abetter text representations can be learned via multiple NLP problems.

About MeI am Data Scientist in Bay Area.

Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related.

You can reach me from Medium Blog, LinkedIn or Github.

Extension ReadingBidirectional Encoder Representations from Transformers (BERT)General Purpose Distributed Sentence Representation (GenSen)ReferenceDevlin J.

, Chang M.


, Lee K.

, Toutanova K.

, 2018.

BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingSandeep S.

, Adam T.

, Yoshua B.

, Christopher J P.

, Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning.

2018Liu X.


, He P.


, Chen W.


, Gao J.



Multi-Task Deep Neural Networks for Natural Language Understanding.

. More details

Leave a Reply