Data Augmentation for Natural Language Processing

A naive approach would be to use a lexicon such as WordNet, which has a fixed definition assigned to each word.

We could then use a binary encoded vector (“one-hot” encoding) which takes a true or false value for each word in a lexicon and use the concatenation of this vector as an input of a sentence.

The problem with this is that the the size of that one-hot encoded vector becomes unmanageable for large text corpora (size of the vector would grow with the vocabulary size).

Therefore, we want a fixed-size representation of words or sentences such that the input for the downstream task remains manageable, also for large text corpora.

The most popular method for this is Word2Vec, which is using the idea of context dependency.

Instead of defining a fixed meaning to each word (such as a lexicon would), we “characterise a word by the company it keeps”.

In other words, each expression in a text corpus is assigned a fixed-sized vector which represents its meaning relative to the other words in a text corpus.

One can achieve this by sliding a “window” over each word in a text corpus.

At each step, we look at a current “center word” and try to predicts its context words.

This boils down to an optimisation task where we want to optimise the probability of the context words to appear given the current center word.

Illustration of the training process to learn word vectors [1]At each timestep t of a given word sequence, we maximise the probability of two words co-occurring in the first log of the equation and take j negative samples to maximise the probability that the real outside word appears and minimise the probability that random words appear around the center words (full detail of this method can be found here).

Based on this setup, a shallow two-layer neural network adjusts the word vectors θ using stochastic gradient descent and backpropagation for the gradient update.

By iterating over large text corpora this way, words with similar meaning (i.


those that appear in similar contexts) end up having a similar vector.

Hereby it is common to learn these vector representations on a large text corpus which may be unrelated to the training task, such as the 3bn word Google News Corpus, CommonCrawl, or the Wikipedia corpus.

After learning the word vectors, we can assess the semantic similarity between two words by looking at the cosine distance of their vectors.

PCA-reduced word vector representations trained on the CommonCrawl text corpusUsing this technique, we have a way to characterise each word in a vocabulary relative to each other, which allows the downstream learning task to abstract away “clusters” of words that have a similar meaning, as shown in the above.

Note that there are a number of adjustments to this simple Word2Vec model, such as Global Vectors, FastText, or Doc2Vec which all are unsupervised learning tasks for fixed-size word vector representations.

Sentence bootstrapping with distributional embeddingsHaving built the intuition of what understanding of language most NLP models use, we can think of ways to use this understanding to generate new text samples.

One way to generate text data is to adjust the existing data to take the form of a semantically similar sentence.

The primary objective of this method is to change the numerical input to the downstream learning task while keeping the same semantics of the sentence.

Taking the example of hate speech classification again, we would want a racist comment (undersampled class) to keep being a racist comment, just using different words.

To generate these slightly alternative samples, we can check for two things for each word in a vocabulary:A cosine distance threshold between all the word vectors in the vocabularyA part-of-speech (POS)-tagHereby, the cosine similarity is simply the dot product of any two given word vectors, scaled by the size of the vectors.

Two vectors with a higher dot product are more similar and hence more interchangeable with each other.

For example, the words “power” and “strength” from the above are likely to be used in similar contexts and in turn have a higher dot product.

A part-of-speech tag is a grammatical tag that we can assign to each word to check whether two words are grammatically the same.

In the below example, we can see that “letting” exceeds the similarity threshold, but is a gerund, while “let” is a regular verb.

Checking for POS-tag equality simply ensures that we don’t make the sentence too grammatically incorrect when augmenting the samples.

Checking for cosine similarity threshold and part-of-speech tagsIn the simple example above we can see that we augmented the existing samples of the sample sentence.

How could this help our sample model to classify hate speech?.Assume the model is trained on multiple very similar abusive samples, where the key abusive word is frequently exchanged with a similar abusive word from the text corpus.

The downstream classifier will put a lower emphasis on the individual hate words, but a stronger emphasis on the context words that may not have been exchanged.

This way, we induce the learning of speech patterns as opposed to individual words on the model.

This is useful for this learning task specifically, as the model is less likely to be led astray by common permutations that users may use to circumvent lexical hate speech checks or by people using abusive language without actually saying something abusive (e.


quoting someone saying something racist).

The threshold in this method is treated as an additional hyperparameter that is to be optimised.

The intuition behind optimising for the cosine similarity threshold is that two words must have been seen in sufficient equal contexts such that one can replace one with the other without confusing the downstream task.

I tried this method on a number of hate speech datasets, and achieved a 4–6% accuracy improvement and, most importantly, an up to 25% better recall on the undersampled class.

Generative models for text augmentationAn alternative approach to upsample the minority class is to generate new samples from scratch.

You may have read about Microsoft’s chat bot that turned racist by accident?.How about purposely creating the most racist and mysoginistic chat bot possible to generate new hate speech samples that we do not need to label anymore?To do this, we could train a Markovian or RNN language model on the training data for each class, then pick a random start word and let the model predict the next word in the sequence.

For simplicity, let’s start off by looking at Markovian language models.

Recall that a Markov Chain is based on the concept that the next element in a sequence is determined by the current element and its highest transitional probability to any of the previous elements.

If we apply this to text, we could map each unique word in a corpus to take a state and define the transitional probabilities to be the probability of one word being the adjacent word to another one.

To illustrate this, let’s take the song lyrics of “Imagine” from the Beatles as an example:Imagine there’s no heaven, It’s easy if you try, No hell below us, Above us only sky,Imagine all the people, Living for today….

If we map each word in these lyrics to a state, we will see that there are some words such as “Imagine”, “no”, or “us” that co-occur with more than one word.

If we now start a random walk through the chain of states illustrated below, we could come up with a new sample.

Illustration of a Markov-based language model for text generationOf course, this example is very simple and works a lot better when applying Markovian language models to larger text corpora, where the number of word co-occurrences is higher.

The problem with using Markov Chains is that they are memory-less models, meaning that they only take the current state (word) at each timestep into account and ignore the entire previous sequence.

This leads to sentences that are grammatically incorrect in many cases.

An improvement to this method would be to train a Recurrent Neural Network, which takes the current word as well as the entire previous sequence into account.

To do so, we could use a Long-Short Term Memory unit (LSTM) which allows to either put a higher or lower emphasis on more recent timesteps, thus being able to adjust for longer or shorter text sequences to be generated.

To implement this, one must train a separate RNN model for each class, for which we used a W2V embedding for an embedding layer, which is passed into two subsequent 128-dimensional LSTM recurrent layers.

The separate outputs from the embedding layer and each LSTM layer are concatenated, outputting a N x 356 feature matrix (N being the number of words).

This feature matrix is passed into a final fully-connected layer, which output is mapped to probabilities for each word in the corpus, for which the highest probability represents the next word in the generated sequence (see implementation on GitHub).

After training this on a range of hate speech datasets (this one and this one), the text generator generated samples that were all similar to those shown below:Generated tweets from LSTM natural language generation modelNote that the generated samples do not always make sense to human (such as the second created hate speech sample).

However, this does not always matter for the downstream task, since it merely is able to recognise more combinations of language patterns and vocabulary from the training corpus.

Trying this technique on a number of hate speech datasets, I found this to further improve accuracy by 2–3% and hate speech class recall by 8–10%.

NLP and Chinese WhispersOne last remark regarding the design of NLP systems should be made for larger systems where multiple embeddings (i.


mapping from words to numbers, such as Word2Vec explained above) are used.

In this situation, performance is maximised if all the embedding techniques in the system are aligned.

For example, assume we were to design a model that trains the sentence bootstrapping augmentation on the GoogleNews Corpus, the Natural Language Generation model using an alternative embedding technique (e.


GloVe), and the downstream embedding using a different text corpus on the same technique (e.



Perhaps unsurprisingly, this model is likely to perform worse than a model that uses the same embedding and text corpora throughout, because each component of the model has a slightly different understanding of language.

Intuitively, we could relate this to a game of Chinese whispers — if player A (augmentation) has a different understanding of a word compared to player B (downstream embedding), the final output at the end of the game is more likely to be confused.

Therefore, we want all components (or players) in this multi-step model to have the same understanding of the words.

Usually after aligning the word representation techniques and training corpora of the embeddings, the accuracy of the model increased by another 1–2% across datasets.

Final remarks & ImplementationIn this post we showcased two relatively simple, but tough-to-beat data augmentation techniques for text.

There are three things that one should bear in mind when using these techniques.

Firstly, even when the generated samples don’t make perfect sense to humans, this may not matter too much for the model.

As long as the model can abstract away another language pattern or a word used in a similar context, its generalisation on unseen data is likely to be improved.

Secondly, these techniques are by no means perfect and there are lot of ways how we could optimise them with regards to distance metrics or neural network architectures used.

For example, an extension would be to use Generative Adversarial Networks (GANs) for the text generator.

Finally, bear in mind that the test data must not be augmented using these techniques, as it would strongly skew the results.

All of the methods mentioned above are implemented and published on GitHub under the Apache 2.

0 License.

If you are interested on finding out more about this topic, find the full workings here.

[1] T.

Mikolov, K.

Chen, G.

Corrado, and J.

Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.

3781, 2013.

.. More details

Leave a Reply