Data Augmentation library for textEdward MaBlockedUnblockFollowFollowingApr 20In previous story, you understand different approaches to generate more training data for your NLP task model.
In this story, we will learn how can you it do it with just a few line codes.
In natural language processing (NLP) field, it is hard to augmenting text due to high complexity of language.
Not every word we can replace it by others such as a, an, the.
Also, not every word has synonym.
Even changing a word, the context will be totally difference.
On the other hand, generating augmented image in computer vision area is relative easier.
Even introducing noise or cropping out portion of image, model can still classify the image.
Introduction to nlpaugAfter used imgaug in computer vision project, I am thinking whether we can have similar library to generate synthetic data.
Therefore, I re-implement those research paper by using existing library and pre-trained model.
Basic elements of nlpaug includes:Character: OCR Augmenter, QWERTY Augmenter and Random Character AugmenterWord: WordNet Augmenter, word2vec Augmenter, GloVe Augmenter, fasttext Augmenter, BERT Augmenter, Random Word CharacterFlow: Sequential Augmenter, Sometimes AugmenterIntuitively, Character Augmenters and Word Augmenters are focusing on character level and word level manipulation respectively.
Flow works as a orchestra the control augmentation flow.
You can access github for the library.
CharacterAugmenting data in character level.
Possible scenarios include image to text and chatbot.
During recognizing text from image, we need to optical character recognition (OCR) model to achieve it but OCR introduces some errors such as recognizing “o” and “0”.
In chatbot, we still have typo even though most of application comes with word correction.
To overcome this problem, you may let your model “see” those possible outcome before online prediction.
OCRWhen working on NLP problem, OCR result may be one of the input of your NLP problem.
For example, “0” may be recognized as “o” or “O”.
If you are using bag-of-words or classic word embeddings as a feature, you will get trouble as out-of-vocabulary (OOV) around you today and always.
If you use state-of-the art model such as BERT and GPT, the OOV issue seems resolved as word will be split to subword.
However, some information is lost.
OCRAug is designed to simulate OCR error.
It will replace target character by pre-defined mapping table.
Example of augmentationOriginal:The quick brown fox jumps over the lazy dogAugmented Text:The quick brown fox jumps over the lazy d0gQWERTYAnother project you may involved is chat bot or other messaging channel such as email.
Although spell checking will be performed, some misspelled still exist.
It may hurt your NLP model as mentioned before.
QWERTYAug is designed to simulate keyword distance error.
It will replace target character by 1 keyword distance.
You can config whether include number or special character or not.
Example of augmentationOriginal:The quick brown fox jumps over the lazy dogAugmented Text:Tne 2uick hrown Gox jumpQ ovdr tNe <azy d8gRandom CharacterFrom different research, noise injection may help to generalized your NLP model sometimes.
In text, we may add some noise to your word such as adding or deleting one character from your word.
RandomCharAug is designed to inject noise to your data.
Unlike OCRAug and QWERTYAug, it supports insertion, substitution and insertion.
Example of insert augmentationOriginal:The quick brown fox jumps over the lazy dogAugmented Text:T(he quicdk browTn Ffox jumpvs 7over kthe clazy 9dogWordBesides character augmentation, word level is important as well.
We make use of word2vec (Mikolov et al.
, 2013), GloVe (Pennington et al.
, 2014), fasttext (Joulin et al.
, 2016), BERT(Devlin et al.
, 2018) and wordnet to insert and substitute similar word.
Word2vecAug, GloVeAug and FasttextAug use word embeddings to find most similar group of words to replace original word.
On the other hand, BertAug use language models to predict possible target word.
WordNetAug use statistics way to find the similar group of words.
Word Embeddings (word2vec, GloVe, fasttext)Classic embeddings use a static vector to present a word.
Ideally, the meaning of word is similar if vectors are near each others.
Actually, it depends on the training data.
For example, “rabbit” is similar to “fox” in word2vec while “nbc” is similar to “fox” in GloVe.
Most similar words of “fox” among classical word embeddings modelsSometimes, you want to replace word by similar word such that NLP model does not relay on single word .
Word2vecAug, GloVeAug andFasttextAug are designed to provide “similar” word based on pre-trained vectors.
Besides substitution, insertion helps to injecting noise to your data.
It picks word from vocabulary randomly.
Example of insert augmentationOriginal:The quick brown fox jumps over the lazy dogAugmented Text:The quick Bergen-Belsen brown fox jumps over Tiko the lazy dogExample of substitute augmentationOriginal:The quick brown fox jumps over the lazy dogAugmented Text:The quick gray fox jumps over to lazy dogContextualized Word EmbeddingsSince classic word embeddings use static vector to represent same word.
It may not fit some scenarios.
For “Fox” can represent as animal and broadcasting company.
To overcome this problem, contextualized word embeddings is introduced to consider surrounding words to generate a vector under different context.
BertAug is designed to provide this feature to perform insertion and substitution.
Different from previous word embeddings, insertion is predicted by BERT language model rather than pick one word randomly.
Substitution use surrounding words as feature to predict the target word.
Example of insert augmentationOriginal:The quick brown fox jumps over the lazy dogAugmented Text:the lazy quick brown fox always jumps over the lazy dogExample of substitute augmentationOriginal:The quick brown fox jumps over the lazy dogAugmented Text:the quick thinking fox jumps over the lazy dogSynonymBesides neural network approach, thesaurus can achieve similar objective.
The limitation of synonym is that some words may not have similar word.
WordNet from awesome NLTK library helps to find the synonym words.
WordNetAug provides substitution feature to replace target word.
Instead of finding synonym purely, some preliminary checking make sure that target word can be replaced.
Those rules are:Do not pick determiner (e.
g.
a, an, the)Do not pick a word which does not has synonym.
Example of augmentationOriginal:The quick brown fox jumps over the lazy dogAugmented Text:The quick brown fox parachute over the lazy blackguardRandom WordSo far we do not introduce deletion in word level.
RandomWordAug can help to remove a word randomly.
Example of augmentationOriginal:The quick brown fox jumps over the lazy dogAugmented Text:The fox jumps over the lazy dogFlowUp to here, the above augmenters can be invoked alone.
What if you want to combine multiple augmenters together?.To make use of multiple augmentation, sequential and sometimes pipelines are introduced to connect augmenters.
Single text can go though different augmenters to generate diversity of data.
SequentialYou can add as much as augmenter you want to this flow and Sequential executes them one by one.
For example, you can combine RandomCharAug and RandomWordAug together.
aug = naf.
Sequential([ nac.
RandomCharAug(action=Action.
INSERT), naw.
RandomWordAug()])aug.
augment(text)SometimesIf you do not want to execute same set of augmenters all the time, sometimes will pick some of augmenters every time.
aug = naf.
Sometimes([ nac.
RandomCharAug(action=Action.
DELETE), nac.
RandomCharAug(action=Action.
INSERT), naw.
RandomWordAug()])aug.
augment(text)RecommendationThe above approach is designed to solve problems which authors are facing in their problem.
If you understand your data, you should tailor made augmentation approach it.
Remember that golden rule in data science is garbage in garbage out.
In general, you can try thesaurus approach without quite understanding of your data.
It may not boost up a lot due to the aforementioned thesaurus approach limitation.
About MeI am Data Scientist in Bay Area.
Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related.
Feel free to connect with me on LinkedIn or following me on Medium or Github.
Extension ReadingImage augmentation library (imgaug)Text augmentation library (nlpaug)ReferenceX.
Zhang, J.
Zhao and Y.
LeCun.
Character-level Convolutional Networks for Text Classification.
2015W.
Y.
Wang and D.
Yang.
That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets.
2015S.
Kobayashi.
Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relation.
2018C.
Coulombe.
Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs.
2018.. More details