Attn: Illustrated Attention

Here’s the entire animation:Fig.


6: AttentionIntuition: How does attention actually work?Answer: Backpropagation, surprise surprise.

Backpropagation will do whatever it takes to ensure that the outputs will be close to the ground truth.

This is done by altering the weights in the RNNs and in the score function, if any.

These weights will affect the encoder hidden states and decoder hidden states, which in turn affect the attention scores.

[Back to top]2.

Attention: ExamplesWe have seen the both the seq2seq and the seq2seq+attention architectures in the previous section.

In the next sub-sections, let’s examine 3 more seq2seq-based architectures for NMT that implement attention.

For completeness, I have also appended their Bilingual Evaluation Understudy (BLEU) scores — a standard metric for evaluating a generated sentence to a reference sentence.


Bahdanau et.

al (2015) [1]This implementation of attention is one of the founding attention fathers.

The authors use the word ‘align’ in the title of the paper “Neural Machine Translation by Learning to Jointly Align and Translate” to mean adjusting the weights that are directly responsible for the score, while training the model.

The following are things to take note about the architecture:The encoder is a bidirectional (forward+backward) gated recurrent unit (BiGRU).

The decoder is a GRU whose initial hidden state is a vector modified from the last hidden state from the backward encoder GRU (not shown in the diagram below).

The score function in the attention layer is the additive/concat.

The input to the next decoder step is the concatenation between the output from the previous decoder time step (pink) and context vector from the current time step (dark green).


2a: NMT from Bahdanau et.


Encoder is a BiGRU, decoder is a GRU.

The authors achieved a BLEU score of 26.

75 on the WMT’14 English-to-French dataset.

Intuition: seq2seq with bidirectional encoder + attentionTranslator A reads the German text while writing down the keywords.

Translator B (who takes on a senior role because he has an extra ability to translate a sentence from reading it backwards) reads the same German text from the last word to the first, while jotting down the keywords.

These two regularly discuss about every word they read thus far.

Once done reading this German text, Translator B is then tasked to translate the German sentence to English word by word, based on the discussion and the consolidated keywords that the both of them have picked up.

Translator A is the forward RNN, Translator B is the backward RNN.


Luong et.

al (2015) [2]The authors of Effective Approaches to Attention-based Neural Machine Translation have made it a point to simplify and generalise the architecture from Bahdanau et.


Here’s how:The encoder is a two-stacked long short-term memory (LSTM) network.

The decoder also has the same architecture, whose initial hidden states are the last encoder hidden states.

The score functions they experimented were (i) additive/concat, (ii) dot product, (iii) location-based, and (iv) ‘general’.

The concatenation between output from current decoder time step, and context vector from the current time step are fed into a feed-forward neural network to give the final output (pink) of the current decoder time step.


2b: NMT from Luong et.


Encoder is a 2 layer LSTM, likewise for decoder.

On the WMT’15 English-to-German, the model achieved a BLEU score of 25.


Intuition: seq2seq with 2-layer stacked encoder + attentionTranslator A reads the German text while writing down the keywords.

Likewise, Translator B (who is more senior than Translator A) also reads the same German text, while jotting down the keywords.

Note that the junior Translator A has to report to Translator B at every word they read.

Once done reading, the both of them translate the sentence to English together word by word, based on the consolidated keywords that they have picked up.

[Back to top]2c.

Google’s Neural Machine Translation (GNMT) [9]Because most of us must have used Google Translate in one way or another, I feel that it is imperative to talk about Google’s NMT, which was implemented in 2016.

GNMT is a combination of the previous 2 examples we have seen (heavily inspired by the first [1]).

The encoder consists of a stack of 8 LSTMs, where the first is bidirectional (whose outputs are concatenated), and a residual connection exists between outputs from consecutive layers (starting from the 3rd layer).

The decoder is a separate stack of 8 unidirectional LSTMs.

The score function used is the additive/concat, like in [1].

Again, like in [1], the input to the next decoder step is the concatenation between the output from the previous decoder time step (pink) and context vector from the current time step (dark green).


2c: Google’s NMT for Google Translate.

Skip connections are denoted by curved arrows.

*Note that the LSTM cells only show the hidden state and input; it does not show the cell state input.

The model achieves 38.

95 BLEU on WMT’14 English-to-French, and 24.

17 BLEU on WMT’14 English-to-German.

Intuition: GNMT — seq2seq with 8-stacked encoder (+bidirection+residual connections) + attention8 translators sit in a column from bottom to top, starting with Translator A, B, …, H.

Every translator reads the same German text.

At every word, Translator A shares his/her findings with Translator B, who will improve it and share it with Translator C — repeat this process until we reach Translator H.

Also, while reading the German text, Translator H writes down the relevant keywords based on what he knows and the information he has received.

Once everyone is done reading this English text, Translator A is told to translate the first word.

First, he tries to recall, then he shares his answer with Translator B, who improves the answer and shares with Translator C — repeat this until we reach Translator H.

Translator H then writes the first translation word, based on the keywords he wrote and the answers he got.

Repeat this until we get the translation out.


SummaryHere’s a quick summary of all the architectures that you have seen in this article:seq2seqseq2seq + attentionseq2seq with bidirectional encoder + attentionseq2seq with 2-stacked encoder + attentionGNMT — seq2seq with 8-stacked encoder (+bidirection+residual connections) + attentionThat’s it for now!.In my next post, I will walk through with you the concept of self-attention and how it has been used in Google’s Transformer and Self-Attention Generative Adversarial Network (SAGAN).

Keep an eye on this space!Appendix: Score FunctionsBelow are some of the score functions as compiled by Lilian Weng.

Additive/concat and dot product have been mentioned in this article.

The idea behind score functions involving the dot product operation (dot product, cosine similarity etc.

), is to measure the similarity between two vectors.

For feed-forward neural network score functions, the idea is to let the model learn the alignment weights together with the translation.


A0: Summary of score functionsFig.

A1: Summary of score functions.

(Image source)[Back to top]References[1] Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et.

al, 2015)[2] Effective Approaches to Attention-based Neural Machine Translation (Luong et.

al, 2015)[3] Attention Is All You Need (Vaswani et.

al, 2017)[4] Self-Attention GAN (Zhang et.

al, 2018)[5] Sequence to Sequence Learning with Neural Networks (Sutskever et.

al, 2014)[6] TensorFlow’s seq2seq Tutorial with Attention (Tutorial on seq2seq+attention)[7] Lilian Weng’s Blog on Attention (Great start to attention)[8] Jay Alammar’s Blog on Seq2Seq with Attention (Great illustrations and worked example on seq2seq+attention)[9] Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (Wu et.

al, 2016)Related ArticlesAnimated RNN, LSTM and GRULine-by-Line Word2Vec Implementation (on word embeddings)Special thanks to Derek, William Tjhi, Yu Xuan, Ren Jie, Chris, and Serene for ideas, suggestions and corrections to this article.

Follow me on Twitter or LinkedIn for digested articles and demos on AI and Deep Learning.


. More details

Leave a Reply