Project: Population-Based Training for Machine Translation with different metaheuristic algorithmsYevheniia KryvenkoBlockedUnblockFollowFollowingMay 28Authors: O.

Matsuk, Y.

Kryvenko, D.

GrygorianBuilding an optimal neural network (NN) model for a specific task is a complex problem with many pitfalls.

Common research domains, such as image classification, caption generation, or machine translation, nowadays offer a wide selection of architectures to choose from, but their performance still heavily depends on various factors, most notably — the choice of hyperparameters.

Hyperparameter tuning is one of the most time and energy consuming tasks in NN training.

It usually involves many iterations of re-fitting the network and manually updating the parameters towards more promising configurations.

Population-Based Training (PBT) [1] aims to solve this issue by providing a framework which trains models in parallel but shares knowledge within the population to better guide the hyperparameter search.

In this project we will attempt to evaluate PBT and analyse how well it handles exploration-exploitation with a different metaheuristic.

IntroductionData descriptionMachine translationPopulation-based trainingMeta-heuristic explorationIntroductionPBT is a general-purpose technique which can be applied to practically any NN training task.

In this study we will build an English-to-German neural machine translation (NMT) model as our baseline.

We will provide a slightly simplified implementation of PBT and apply it to train our model.

Lastly, we will use particle swarm optimization as a substitute for PBT metaheuristic and compare the results to the vanilla PBT.

Data descriptionTraining an NMT model requires a large number of translation pairings in desired languages.

For our task we will use the WMT 2014 English-to-German dataset [2].

It consists of 1920209 sentences in both languages.

In order to make the experimentation more feasible we use 30% of the dataset as the training data.

We tokenize, trim and pad all sentences to length 30.

Using the training set we build the vocabularies and replace words with <5 occurrences with the <UNK> token to reduce vocabulary sizes.

As a result, we get [576062, 30] matrices of encoded sequences with 24976 words in English and 50141 in German vocabularies.

Machine TranslationThe most successful recent architectures for NMT are based on the Encoder-Decoder (a.

k.

a Sequence-to-Sequence) framework [2].

It consists of two recurrent neural networks (RNN) — an encoder and a decoder.

The encoder accepts a source sequence and produces a fixed-length representation, called the context vector, which is usually the last hidden state of the encoder RNN.

The decoder uses the context vector together with a target sequence to predict an output word by word by modeling a probability distribution over a fixed target vocabulary.

In training, Both RNNs are optimized jointly using the expected target sequence as decoder input.

During inference, decoder outputs at each timestep become decoder inputs for the next word prediction.

Figure 1.

Encoder-Decoder architecture.

Using the context vector S produced by the encoder based on the source sequence in English, decoder is trained to predict the target sentence in German.

The architecture can be further improved by using an attention layer [3], which is meant to selectively focus on parts of the source sentence during prediction.

The attention layer allows the decoder to use information from past encoder states in addition to the limited source sequence representation in the form of a context vector.

Figure 2.

Encoder-Decoder with Attention layer.

Each decoder output uses attention weightA relatively simple variation of such architecture, inspired by attention_keras, is implemented and used as the baseline model.

Notably, we simplify the attention layer and add a Dropout layer before the final prediction to use its dropout rate as one of the hyperparameters for PBT.

Figure 3.

Neural machine translation baseline model architectureWe train this model for 1 epoch on the train data with a batch size of 64, which results in 9000 steps.

We use Adam optimizer with learning rate 0.

001 and fix the dropout rate to 0.

2.

To evaluate the model performance we use the BLEU score metric, which is a common choice for machine translation.

It is based on the number of n-gram matches between the target and predicted sentences, usually for n up to 4.

Figure 4.

Baseline loss and BLEU-score measured on a validation set of size 6400 every 100 steps.

Baseline test BLEU score: 0.

0801Just for fun we can check out some of the translations produced by our baseline model:Population-Based TrainingAs we already mentioned, hyperparameter tuning is a challenging and resource-heavy task.

In the context of NN training the term hyperparameters may refer to any configurable value, except only for the regular parameters (weights) of the network.

For example, number of epochs, learning rate, numbers of layers and units, weight initializers, regularizer coefficients etc.

PBT handles hyperparameter tuning under the assumption of fixed architecture, so we can only consider hyperparameters which do not affect the network dimensionality.

Hyperparameter tuning can be done either sequentially, meaning all models are trained one after another, or in parallel, so that all models are trained independently at the same time.

In the first case, the tuning becomes extremely slow, and in the second one, a lot of resources are wasted on the models with bad performance since they are compared only once the training is done.

PBT [1] combines both approaches and addresses their main drawbacks.

In PBT, a population of models with different hyperparameters is trained in parallel with a periodic exchange of knowledge between the members.

On such occasions models are evaluated against each other, and the ones with better performance propagate information to the weaker ones.

Figure 5: Population-based trainingOnce we initialized a population of models with different hyperparameters, we repetitively perform the following operations:Train each population member for a certain number of stepsEvaluate each member based on loss or a different (possibly non-differentiable) metric correlated to lossExploit the best models by discarding the worst models and overwriting their parameters (weights) and hyperparametersExplore new hyperparameters by randomly perturbing the hyperparameters of the best modelsFor simplicity, we provide a synchronous PBT implementation, similar to this one.

It evaluates all members of the population sequentially before finishing the shared step.

For our NMT model we consider 2 hyperparameters with their corresponding initial search spaces:learning rate: [1e-4; 1]dropout [0.

1; 0.

5]We train a population of 8 models with PBT in the same experiment settings:Figure 6.

Vanilla population-based training validation loss and BLEU-scoreVanilla PBT test BLEU score: 0.

1212As expected, PBT is able to explore multiple hyperparameter configurations and reach better performance than a single model within the same amount of steps.

We can also observe how the hyperparameter space is explored during the training:Figure 7.

Hyperparameter modifications of the vanilla population-based-training members.

Meta-heuristic explorationNow that we established some points of comparison we can try to modify the PBT algorithm by changing its meta-heuristic, which defines the conditions and exact logic of the explore/exploit operations.

For this purpose we will use the particle swarm optimization (PSO) algorithm.

Figure 8.

Particle Swarm OptimizationIn particle swarm optimization all members have a velocity vector which modifies their position over time.

The velocity is also periodically updated based on the best result found by the member p and the best overall result within the population g:It is quite straightforward to modify our PBT implementation to use velocity as the mean to update member hyperparameters.

Figure 9.

Population-based training with particle swarm optimization validation loss and BLEU-scorePBT with PSO test BLEU score: 0.

0935It is worth pointing out that this version of PBT does not incorporate weight overwriting, which is a possible explanation for why we end up with such low average metrics over the population, and are consequently unable to outperform the vanilla PBT meta-heuristic.

ConclusionThe results suggest that the hyperparameter tuning in NN training cannot be treated as a simple optimization problem, and that the optimal hyperparameters form a schedule over time and are conditioned on the state of the network.

This experiment highlights the crucial points within the PBT algorithm and establishes an area for future exploration of meta-heuristics applicable for its improvement.

The full code in Python can be found here.

References[1]: Jaderberg et al.

(2017) Population Based Training of Neural Networks — https://arxiv.

org/pdf/1711.

09846.

pdf[2] Sutskever et al.

(2014) Sequence to Sequence Learning with Neural Networks — https://arxiv.

org/abs/1409.

3215[3]: Bahdanau et al.

(2015) Neural machine translation by jointly learning to align and translate — https://arxiv.

org/pdf/1409.

0473.

pdf[4]: ACL 2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION — https://www.

statmt.

org/wmt14/translation-task.

html.. More details