NLP Breakthrough Imagenet Moment has arrived

(Source: Matthew Peters)In light of this step change, it is very likely that in a year’s time NLP practitioners will download pretrained language models rather than pretrained word embeddings for use in their own models, similarly to how pre-trained ImageNet models are the starting point for most CV projects nowadays.However, similar to word2vec, the task of language modeling naturally has its own limitations: It is only a proxy to true language understanding, and a single monolithic model is ill-equipped to capture the required information for certain downstream tasks..For instance, in order to answer questions about or follow the trajectory of characters in a story, a model needs to learn to perform anaphora or coreference resolution..In addition, language models can only capture what they have seen..Certain types of information, such as most common sense knowledge, are difficult to learn from text alone and require incorporating external information..One outstanding question is how to transfer the information from a pre-trained language model to a downstream task..The two main paradigms for this are whether to use the pre-trained language model as a fixed feature extractor and incorporate its representation as features into a randomly initialized model as used in ELMo, or whether to fine-tune the entire language model as done by ULMFiT..The latter fine-tuning approach is what is typically done in CV where either the top-most or several of the top layers are fine-tuned..While NLP models are typically more shallow and thus require different fine-tuning techniques than their vision counterparts, recent pretrained models are getting deeper..The next months will show the impact of each of the core components of transfer learning for NLP: an expressive language model encoder such as a deep BiLSTM or the Transformer, the amount and nature of the data used for pretraining, and the method used to fine-tune the pretrained model.Our analysis thus far has been mostly conceptual and empirical, as it is still poorly understood why models trained on ImageNet—and consequently on language modeling—transfer so well..One way to think about the generalization behaviour of pretrained models more formally is under a model of bias learning (Baxter, 2000)..Assume our problem domain covers all permutations of tasks in a particular discipline, e.g..computer vision, which forms our environment.. More details

Leave a Reply