Learning More with Less

From a data perspective that means we see conversational datasets that are highly contextual to a partner in the language they use and vary dramatically in size — both are situations of which would benefit from what has been demonstrated by ULMFiT.For this research we focus on answering the following:If I have a small, fixed budget for labeled examples, how much unlabeled domain-specific data do I have to collect to make effective use of transfer learning?We answered the above with an experiment that pairs with fast.ai’s like this: they used a large, fixed pool of domain data and varied the number of labeled examples to show how the model improved..We held the number of labeled examples constant and varied the amount of additional unlabeled domain examples..More formally, our experiment consists ofLanguage Modeling (variant)Language Task (invariant)Our language task, sentiment classification, is the same as the task in the original ULMFiT paper and uses the IMDB movie review dataset..We hold the number of labeled sentiment training examples to 500 across all experiments, a number we thought was attainable for many small domains and would help emphasize the differential lifting power of different language models.Our research in the context of the results from the original ULMFiT paper aboveFor language modeling, we vary the amount of domain data available to three alternative language models feeding into the language task:ULM Only: this is using Wikitext103 pre-trained English language modelDomain Only: a domain based language model trained only on IMDB dataULM + Domain: the ULMFiT modelTraining these models is computationally intense..At the largest domain sizes training can take a few days to complete on a typical work computer..To speed things up and to efficiently execute our experiment grid search in parallel we used FloydHub..(Additionally, the folks over at FloydHub keep their machines on the cutting edge, having both PyTorch v1 and fastai v1 capable GPU machines available almost immediately after they were released.)The ResultsAfter about 50 hours worth of intense GPU processing — but only 3 hours of wall clock time, thanks to FloydHub — we have our results!What we see above tells a clear story:Considering both broad language structures from the ULM and unlabeled domain text always leads to a major improvement — even when domain text is minimal.We can get 75% of the performance UMLFiT reported with 33% of the domain data.Amazingly, ULM + 2,000 domain examples reached nearly 85% language task prediction accuracy.Making Machine Learning Work for EveryoneAt Frame, we’re obsessed with transfer learning because it enables a collaborative approach to AI: We think NLP should be a tool that our users can work with to explore and express a view about their own data — not just a way of transmitting received wisdom from some other corpus..When you start from that point of view, the question is less “how much data can I aggregate to inform a generic model”, and more “what is the narrowest domain for which I can develop a useful, specialized model”..How can we target specific use cases, and work from day zero, regardless of the amount of available data?Our results confirm the value of the ULM+domain approach — it allows you to gracefully improve on specialized tasks as more unlabeled domain data becomes available..Moreover, we’ve shown the improvement comes rapidly, generating many of the learning benefits demonstrated in the ULMFiT paper with a fraction of the unlabeled data..Mapping how passively collected domain data improves our models helps us make informed decisions about when to surface domain-specific models, such as our neural tags — important both for our costs and our guidance to customers.This is great news for any company that delivers NLU data products based on emerging or rapidly evolving language domains..Shortening the time-to insight doesn’t just mean we can target narrower domains.. More details

Leave a Reply