Neural Machine Translation

Neural Machine TranslationA guide to Neural Machine Translation using an Encoder Decoder structure with attention.

Includes a detailed tutorial using PyTorch in Google Colaboratory.

Quinn LannersBlockedUnblockFollowFollowingJun 3Image from pixabay.

comMachine Translation (MT) is a subfield of computational linguistics that is focused on translating text from one language to another.

With the power of deep learning, Neural Machine Translation (NMT) has arisen as the most powerful algorithm to perform this task.

While Google Translate is the leading industry example of NMT, tech companies all over the globe are going all in on NMT.

This state-of-the-art algorithm is an application of deep learning in which massive datasets of translated sentences are used to train a model capable of translating between any two languages.

With the vast amount of research in recent years, there are several variations of NMT currently being investigated and deployed in the industry.

One of the older and more established versions of NMT is the Encoder Decoder structure.

This architecture is composed of two recurrent neural networks (RNNs) used together in tandem to create a translation model.

And when coupled with the power of attention mechanisms, this architecture can achieve impressive results.

This post is broken into two distinct parts.

The first section consists of a brief explanation of NMT and the Encoder Decoder structure.

Following this, the latter part of this article provides a tutorial which will allow the chance for you to create one of these structures yourself.

This code tutorial is based largely on the PyTorch tutorial on NMT with a number of enhancements.

Most notably, this code tutorial can be run on a GPU to receive significantly better results.

Before we begin, it is assumed that if you are reading this article you have at least a general knowledge of neural networks and deep learning; particularly the ideas of forward-propagation, loss functions and back-propagation, and the importance of train and test sets.

If you are interested in jumping straight to the code, you can find the complete Jupyter notebook (or Python script) of the Google Colab tutorial outlined in this article on my GitHub page for this project.

Brief Explanation of NMT and the Encoder Decoder StructureThe ultimate goal of any NMT model is to take a sentence in one language as input and return that sentence translated into a different language as output.

The figure below is a naive representation of a translation algorithm (such as Google Translate) tasked with translating from English to Spanish.

Figure 1: Translation from English to Spanish of the English sentence “the cat likes to eat pizza”Before diving into the Encoder Decoder structure that is oftentimes used as the algorithm in the above figure, we first must understand how we overcome a large hurdle in any machine translation task.

Namely, we need a way to transform sentences into a data format that can be inputted into a machine learning model.

In essence, we must somehow convert our textual data into a numeric form.

To do this in machine translation, each word is transformed into a One Hot Encoding vector which can then be inputted into the model.

A One Hot Encoding vector is simply a vector with a 0 at every index except for a 1 at a single index corresponding to that particular word.

In this way, each word has a distinct One Hot Encoding vector and thus we can represent every word in our dataset with a numerical representation.

The first step towards creating these vectors is to assign an index to each unique word in the input language, and then repeat this process for the output language.

In assigning a unique index to each unique word, we will be creating what is referred to as a Vocabulary for each language.

Ideally, the Vocabulary for each language would simply contain every unique word in that language.

However, given that any single language can have hundreds of thousands of words, the vocabulary is often trimmed to the N most common words in the dataset we are working with (where N is chosen arbitrarily, but often ranges from 1,000–100,000 depending on the dataset size).

To understand how we can then use a Vocabulary to create One Hot Encoding vectors for every word in our dataset, consider a mini-Vocabulary containing just the words in Table 1 below.

Table 1: Mini-vocabulary for the English languageGiven this table, we have assigned a unique index 0–12 to every word in our mini-Vocabulary.

The <SOS> and <EOS> tokens in the table are added to every Vocabulary and stand for START OF SENTENCE and END OF SENTENCE respectively.

They are used by the NMT model to help identify these crucial points in sentences.

Now, let’s say we want to convert the words in the sentence “the blue whale ate the red fish” to their one hot encoding vectors.

Using Table 1, we would do this as shown in Figure 2 below.

Figure 2: One Hot Encoding vectors for the sentence “the blue whale ate the red fish”As you can see above, each word becomes a vector of length 13 (which is the size of our vocabulary) and consists entirely of 0s except for a 1 at the index that was assigned to that word in Table 1.

By creating a vocabulary for both the input and output languages, we can perform this technique on every sentence in each language to completely transform any corpus of translated sentences into a format suitable for the task of machine translation.

Now, with an understanding of how we can represent textual data in a numeric way, let’s look at the magic behind this Encoder Decoder algorithm.

At the most basic level, the Encoder portion of the model takes a sentence in the input language and creates a thought vector from this sentence.

This thought vector stores the meaning of the sentence and is subsequently passed to a Decoder which outputs the translation of the sentence in the output language.

This process is shown in the figure below.

Figure 3: Encoder Decoder structure translating the English sentence “the cat likes to eat pizza” to the Spanish sentence “el gato le gusta comer pizza”In the above architecture, the Encoder and the Decoder are both recurrent neural networks (RNN).

In this particular tutorial, we will be using Long Short-Term Memory (LSTM) models, which are a type of RNN.

However other RNN architectures, such as a GRU, are often used.

At a basic level, RNNs are neural networks designed specifically to deal with temporal/textual data.

This article will give a high-level overview of how RNNs work in the context of NMT, however, I would strongly recommend looking further into these concepts if you are not already familiar with them.

For a more thorough explanation of RNNs and LSTMs see here, and for a deeper article on LSTMs in the context of language translation, in particular, see here.

In the case of the Encoder, each word in the input sentence is fed separately into the model in a number of consecutive time-steps.

At each time-step, t, the model updates a hidden vector, h, using information from the word inputted to the model at that time-step.

This hidden vector works to store information about the inputted sentence.

In this way, since no words have yet been inputted to the Encoder at time-step t=0, the hidden state in the Encoder starts out as an empty vector at this time-step.

We represent this hidden state with the blue box in Figure 4, where the subscript t=0 indicates the time-step and the superscript E corresponds to the fact that it’s a hidden state of the Encoder (rather than a D for the Decoder).

Figure 4: Encoder hidden vector at t=0At each time-step, this hidden vector takes in information from the inputted word at that time-step, while preserving the information it has already stored from previous time-steps.

Thus, at the final time-step, the meaning of the whole input sentence is stored in the hidden vector.

This hidden vector at the final time-step is the thought vector referred to above, which is then inputted into the Decoder.

The process of encoding the English sentence “the cat likes to eat pizza” is represented in Figure 5.

Figure 5: Encoding of the sentence “the cat likes to eat pizza”In the above figure, the blue arrows correspond to weight matrices, which we will work to enhance through training to achieve more accurate translations.

Also, notice how the final hidden state of the Encoder becomes the thought vector and is relabeled with superscript D at t=0.

This is because this final hidden vector of the Encoder becomes the initial hidden vector of the Decoder.

In this way, we are passing the encoded meaning of the sentence to the Decoder to be translated to a sentence in the output language.

However, unlike the Encoder, we need the Decoder to output a translated sentence of variable length.

Thus, we are going to have our Decoder output a prediction word at each time-step until we have outputted a complete sentence.

In order to start this translation, we are going to input a <SOS> tag as the input at the first time-step in the Decoder.

Just as in the Encoder, the Decoder will use the <SOS> input at time-step t=1 to update its hidden state.

However, rather than just proceeding to the next time-step, the Decoder will use an additional weight matrix to create a probability over all of the words in the output vocabulary.

In this way, the word with the highest probability in the output vocabulary will become the first word in the predicted output sentence.

This first step of the Decoder, translating from “the cat likes to eat pizza” to “el gato le gusta comer pizza” is shown in Figure 6.

For the sake of simplicity, the output vocabulary is restricted to the words in the output sentence (but in practice would consist of the thousands of words in the entire output vocabulary).

Figure 6: First step of the DecoderNow, given that the word “el” was given the highest probability, this word becomes the first word in our outputted prediction sentence.

And we proceed by using “el” as the input in the next time-step as in Figure 7 below.

Figure 7: Second step of the DecoderWe proceed in this way through the duration of the sentence — that is until we run into an error such as that depicted below in Figure 8.

Figure 8: Translation error in DecoderAs you can see, the Decoder has predicted “pizza” to be the next word in the translated sentence, when it should actually be “comer”.

When testing the model on the test set, we would do nothing to correct this error and would allow the Decoder to use this improper prediction as the input at the next time-step.

However, during the training process, we are going to keep “pizza” as the predicted word at that point in the sentence, but force our Decoder to input the correct word “comer” as the input for the next time-step.

This is a strategy referred to as teacher-forcing and helps speed up the training process.

It is shown in the below figure.

Figure 9: Teacher-forcingNow, since the Decoder has to output prediction sentences of variable lengths, the Decoder will continue predicting words in this fashion until it predicts the next word in the sentence to be a <EOS> tag.

Once this tag has been predicted, the decoding process is complete and we are left with a complete predicted translation of the input sentence.

The entire process of decoding the thought vector for the input sentence “the cat likes to eat pizza” is shown in Figure 10.

Figure 10: Decoding of the sentence “the cat likes to eat pizza”We can then compare the accuracy of this predicted translation to the actual translation of the input sentence to compute a loss.

While there are several varieties of loss functions, a very common one to utilize is the Cross-Entropy Loss.

The equation of this loss function is detailed in Figure 11.

Figure 11: Cross-Entropy Loss functionIn essence, what this loss function does is sum over the negative log likelihoods that the model gives to the correct word at each position in the output sentence.

Given that the negative log function has a value of 0 when the input is 1 and increases exponentially as the input approaches 0 (as shown in Figure 12), the closer the probability that the model gives to the correct word at each point in the sentence is to 100%, the lower the loss.

Figure 12: Graph of the function y = -log(x)For example, given that the correct first word in the output sentence above is “el”, and our model gave a fairly high probability to the word “el” at that position, the loss for this position would be fairly low.

Conversely, since the correct word at time-step t=5 is “comer”, but our model gave a rather low probability to the word “comer”, the loss at that step would be relatively high.

By summing over the loss for each word in the output sentence a total loss for the sentence is obtained.

This loss corresponds to the accuracy of the translation, with lower loss values corresponding to better translations.

When training, the loss values of several sentences in a batch would be summed together, resulting in a total batch loss.

This batch loss would then be used to perform mini-batch gradient descent to update all of the weight matrices in both the Decoder and the Encoder.

These updates modify the weight matrices to slightly enhance the accuracy of the model’s translations.

Thus, by performing this process iteratively, we eventually construct weight matrices that are capable of creating quality translations.

If you are unfamiliar with the concept of batches and/or mini-batch gradient descent you can find a short explanation of these concepts here.

As mentioned in the introduction, an attention mechanism is an incredible tool that greatly enhances an NMT model’s ability to create accurate translations.

While there are a number of different types of attention mechanisms, some of which you can read about here, the model built in this tutorial uses a rather simple implementation of global attention.

In this method of attention, at each time-step, the Decoder “looks back” at all of the hidden vectors of the Encoder to create a memory vector.

It then uses this memory vector, along with the hidden vector in the Decoder at that time-step, to predict the next word in the translated sentence.

In doing this, the Decoder utilizes valuable information from the Encoder that would otherwise go to waste.

A visual representation of this process is shown in Figure 13.

I’d recommend reading the linked article in this paragraph to learn more about the various ways this memory vector can be calculated to gain a better understanding of this important concept.

Figure 13: Attention mechanism for time-step t=1 in DecoderNote: Attention mechanisms are incredibly powerful and have recently been proposed (and shown) to be more effective when used on their own (i.

e.

without any RNN architecture).

If you’re interested in NMT I’d recommend you look into transformers and particularly read the article “Attention Is All You Need”.

Coding Tutorial (Python)Before beginning the tutorial I would like to reiterate that this tutorial is derived largely from the PyTorch tutorial “Translation with a Sequence to Sequence Network and Attention”.

However, this tutorial is optimized in a number of ways.

Most notably, this code allows for the data to be separated into batches (thus allowing us to utilize the enhanced parallel computing power of a GPU), can split datasets into a train and a test set, and also has added functionality to run on datasets of various formats.

Before we dive into the code tutorial, a little setup is in store.

If you’d like to run the model on a GPU (highly recommended), this tutorial is going to be using Google Colab; which offers free access to Jupyter notebooks with GPU capability.

If you have other access to a GPU then feel free to use that as well.

Otherwise, you can look into a variety of other free online GPU options.

The code can be run on a CPU, but the capability of any model will be constricted by computational power (and make sure to change to batch-size to 1 if you choose to do so).

To get started, navigate to Google Colaboratory and log into a Google account to get started.

From here, navigate to File > New Python 3 Notebook to launch a Jupyter notebook.

Once you’ve opened up a new notebook, we first need to enable GPU capabilities.

To do so, navigate to the top left of the page and select Edit > Notebook Settings.

From here select GPU in the dropdown menu under “Hardware accelerator.

”Figure 14: Enabling GPU capabilities on Google ColabWe now have a Jupyter notebook with GPU capabilities and can start working towards creating an NMT model!.First, we will import all of the necessary packages.

Now, run the following code to check if GPU capabilities are enabled.

If TRUE is returned, GPU is available.

Now, before we begin doing any translation, we first need to create a number of functions which will prepare the data.

The following functions serve to clean the data and allow functionality for us to remove sentences that are too long or whose input sentences don’t start with certain words.

Now, with functions that will clean the data, we need a way to transform this cleaned textual data into One Hot Encoding vectors.

First, we create a Lang class which will essentially allow us to construct a vocabulary for both the input and output languages.

Next, we create a prepareLangs function which will take a dataset of translated sentences and create Lang classes for the input and the output languages of a dataset.

This function has the ability to work with input and output sentences that are contained in two separate files or in a single file.

If the sentences are in two separate files, each sentence must be separated by a newline and each line in the files must correspond to each other (i.

e.

make a sentence pair).

For example, if your input file is english.

txt and output file in espanol.

txt the files should be formatted as in Figure 15.

Figure 15: Format for dataset stored in two separate files.

On the other hand, if the input and output sentences are stored in a single file, each sentence in the pair must be separated by a tab and each sentence pair must be separated by a newline.

For example, if your single file name is data.

txt, the file should be formatted as in Figure 16.

Figure 16: Format for dataset stored in one single file.

Note: In order for this function to work with both one and two files, the file_path argument must be in the tuple format with two elements in the tuple if the data is stored in two files, and one element in the tuple if the data is stored in a single file.

With a function that works to prepare the language vocabularies for both the input and output languages, we can use all of the above functions to create a single function that will take a dataset of both input and target sentences and complete all of the preprocessing steps.

Thus, the prepareData function will creates Lang classes for each language and fully clean and trim the data according to the specified passed arguments.

In the end, this function will return both language classes along with a set of training pairs and a set of test pairs.

While we have created a vocabulary for each language, we still need to create functions which use these vocabularies to transform sentence pairs both to and from their One Hot Encoding vector representations.

NMT is no different than normal machine learning in that minibatch gradient descent is the most effective way to train a model.

Thus, before we begin building our model, we want to create a function to batchify our sentence pairs so that we can perform gradient descent on mini-batches.

We also create the function pad_batch to handle the issue of variable length sentences in a batch.

This function essentially appends<EOS> tags to the end of each of the shorter sentences until every sentence in the batch is the same length.

And with that, we have created all of the necessary functions to preprocess the data and are finally ready to build our Encoder Decoder model!.With a general understanding of the Encoder Decoder architecture and attention mechanisms, let’s dive into the Python code that creates these frameworks.

Rather than explain each aspect of the Encoder and the Decoder, I will simply provide the code and refer you to the PyTorch documentation for any questions you may have on various aspects of the code.

Now, in order to train and test the model, we will use the following functions.

The train_batch function below performs a training loop on a single training batch.

This includes a completing a forward pass through the model to create a predicted translation for each sentence in the batch, computing the total loss for the batch, and then back-propagating on the loss to update all of the weight matrices in both the Encoder and the Decoder.

The train function simply performs the train_batch function iteratively for each batch in a list of batches.

In this way, we can pass a list of all of the training batches to complete a full epoch through the training data.

The following test_batch and test functions are essentially the same as the train_batch and train functions, with the exception that these test functions are to be performed on the test data and do not include a back-propagation step.

Thus, these functions do not update the weight matrices in the model and are solely used to evaluate the loss (i.

e.

the accuracy) on training data.

In turn, this will help us track how the model performs on data outside of the training set.

During training, it will also be nice to be able to track our progress in a more qualitative sense.

The evaluate function will allow us to do so by returning the predicted translation that our model makes for a given input sentence.

And the evaluate_randomly function will simply predict translation for a specified number of sentences chosen randomly from the test set (if we have one) or the train set.

A few helper functions below will work to plot our training progress, print memory consumption, and reformat time measurements.

And finally, we can put all of these functions into a master function which we will call train_and_test.

This function will take quite a few arguments, but will completely train our model while evaluating our progress on the train set (and test set if present) at specified intervals.

Also, some arguments will specify whether we want to save the output in a separate .

txt file, create a graph of the loss values over time, and also allow us to save the weights of both the Encoder and the Decoder for future use.

The next few cells after this function will outline how you can modify each argument, but just know that this function will essentially be all we need to run in order to train the model.

Now that we have everything in place we are ready to import our dataset, initialize all of the hyperparameters, and start training!First, in order to upload a dataset, run the following cell:And you will see the following:Figure 17: Upload data to Google ColabSimply click on the “Choose Files” button and navigate to the dataset you wish to upload.

In this tutorial, we are using the same dataset that was used in the original PyTorch tutorial.

You can download that dataset of English to French translations here.

You can also experiment with a number of other datasets of various languages here.

If you are looking to get more state-of-the-art results I’d recommend trying to train on a larger dataset.

You can find some larger datasets here, but also feel free to use any corpus of translated excerpts as long as they are formatted like in Figure 15 or Figure 16 above.

Note: You may have issues uploading larger datasets to Google Colab using the upload method presented in this tutorial.

If you run into such issues, read this article to learn how to upload large files.

Now, run the following cell to ensure that your dataset has been successfully uploaded.

Figure 18: Run ls to ensure dataset has been uploadedFrom here, edit the following cells to apply to your dataset and desires.

The following cell consists of the variety of hyperparameters that you are going to need to play with towards finding an effective NMT model.

So have fun experimenting with these.

And finally, you just need to run the following cell to train your model according to all of the hyperparameters you set above.

And walla!.You have just trained an NMT model!.Congrats!.If you saved any graphs, output files, or output weights, you can view all of the saved files by running ls again.

And to download any of these files simply run the code below.

Now, if you’d like to test the model on sentences outside both the train and the test set you can do that as well.

Just make sure the sentence you are trying to translate is in the same language as the input language of your model.

I trained my model and the PyTorch tutorial model on the same dataset used in the PyTorch tutorial (which is the same dataset of English to French translations mentioned above).

To preprocess the data, the trim was set to 10 and the eng_prefixes filters that PyTorch used was set to TRUE.

With these restrictions, the dataset was cut to a rather small set of 10,853 sentence pairs.

The PyTorch tutorial broke one of the fundamental rules of machine learning and didn’t to use a test set (not good practice!).

So, just for comparison purposes, I kept all of these sentence pairs in my train set and didn’t use a test set (i.

e.

perc_train_set = 1.

0).

However, I’d recommend that you always use a test set when training any sort of machine learning model.

A comparison of the hyperparameters I chose for my model vs.

the hyperparameters in the PyTorch tutorial model is shown in Table 1.

The graph below in Figure 19 depicts the results of training for 40 minutes on an NVIDIA GeForce GTX 1080 (a bit older GPU, you can actually achieve superior results using Google Colab).

Table 1: Hyperparameters comparisonFigure 19: Loss over 40 minute training period for this tutorial model (My Model) vs PyTorch Tutorial ModelSince this dataset has no training set, I evaluated the model on a few sentences from the train set.

Figure 20: Predicted translation of PyTorch tutorial model (Blue) vs.

My Model (Orange)From these results, we can see that the model in this tutorial can create a more effective translation model in the same amount of training time.

However, when we try to use this model to translate sentences outside of the train set, it immediately breaks down.

We can see this in the model’s attempted translation of the following sentence which was NOT in the dataset.

Figure 21: Failed translation on sentence outside the dataset.

This failure of the model is largely due to the fact that it was trained on such a small dataset.

Furthermore, we were not aware of this problem because we had no test set to check the model’s ability to translate on sentences outside of the train set.

To combat this issue, I retrained my model on the same dataset, this time with a trim=40 and without the eng_prefixes filter.

Even when I set aside 10% of the sentence pairs for a train set, the test set was still over 10x the size of the one used to train the model before (122,251 train pairs).

I also modified the hidden size of the model from 440 to 1080 and decreased the batch size from 32 to 10.

Finally, I changed the initial learning rate to 0.

5 and installed a learning rate schedule which decreased the learning rate by a factor of five after every five epochs.

With this larger dataset and updated hyperparameters, the model was trained on the same GPU.

The loss on the train and test set during training, as well as the translation of the same sentence it failed on above, are shown below.

Figure 22: Train and Test loss vs.

 timeFigure 23: Improved (yet still imperfect) translation of sentence outside of the dataset.

As you can see, the translation of this sentence is significantly improved.

However, in order to achieve a perfect translation, we would probably need to increase the size of the dataset by even more.

ConclusionWhile this tutorial provides an introduction to NMT using the Encoder Decoder structure, the implemented attention mechanism is rather basic.

If you are interested in creating a more state-of-the-art model I’d recommend looking into the concept of local attention and attempting to implement this more advanced type of attention within the Decoder portion of the model.

Otherwise, I hope you enjoyed the tutorial and learned a lot!.The basis of the material covered in this post was from my thesis at Loyola Marymount University.

If you want to take a look at the PPT presentation I used to share these ideas (which includes the majority of the images in this article) you can find that here.

You can also read the Thesis paper I wrote on the topic, which explains the math behind NMT in much greater depth, here.

And lastly, the full Jupyter notebook for this project can be found here or alternatively a Python script version can be found here.

.. More details

Leave a Reply