A Comprehensive Guide to Attention Mechanism in Deep Learning for Everyone

In this section, we will discuss how a simple Attention model can be implemented in Keras.

The purpose of this demo is to show how a simple Attention layer can be implemented in Python.

As an illustration, we have run this demo on a simple sentence-level sentiment analysis dataset collected from the University of California Irvine Machine Learning Repository.

You can select any other dataset if you prefer and can implement a custom Attention layer to see a more prominent result.

Here, there are only two sentiment categories – ‘0’ means negative sentiment, and ‘1’ means positive sentiment.

You’ll notice that the dataset has three files.

Among them, two files have sentence-level sentiments and the 3rd one has a paragraph level sentiment.

  We are using the sentence level data files (amazon_cells_labelled.

txt, yelp_labelled.

txt) for simplicity.

We have read and merged the two data files.

This is what our data looks like: We then pre-process the data to fit the model using Keras’ Tokenizer() class: t=Tokenizer() t.

fit_on_texts(corpus) text_matrix=t.

texts_to_sequences(corpus) The text_to_sequences() method takes the corpus and converts it to sequences, i.


each sentence becomes one vector.

The elements of the vectors are the unique integers corresponding to each unique word in the vocabulary: len_mat=[] for i in range(len(text_matrix)):     len_mat.

append(len(text_matrix[i])) We must identify the maximum length of the vector corresponding to a sentence because typically sentences are of different lengths.

We should make them equal by zero padding.

We have used a ‘post padding’ technique here, i.


zeros will be added at the end of the vectors: from keras.


sequence import pad_sequences text_pad = pad_sequences(text_matrix, maxlen=32, padding=post) Next, let’s define the basic LSTM based model: inputs1=Input(shape=(features,)) x1=Embedding(input_dim=vocab_length+1,output_dim=32,             input_length=features,embeddings_regularizer=keras.



001))(inputs1) x1=LSTM(100,dropout=0.


2)(x1) outputs1=Dense(1,activation=sigmoid)(x1) model1=Model(inputs1,outputs1) Here, we have used an Embedding layer followed by an LSTM layer.

The embedding layer takes the 32-dimensional vectors, each of which corresponds to a sentence, and subsequently outputs (32,32) dimensional matrices i.


, it creates a 32-dimensional vector corresponding to each word.

This embedding is also learnt during model training.

Then we add an LSTM layer with 100 number of neurons.

As it is a simple encoder-decoder model, we don’t want each hidden state of the encoder LSTM.

We just want to have the last hidden state of the encoder LSTM and we can do it by setting ‘return_sequences’= False in the Keras LSTM function.

But in Keras itself the default value of this parameters is False.

So, no action is required.

The output now becomes 100-dimensional vectors i.


the hidden states of the LSTM are 100 dimensional.

This is passed to a feedforward or Dense layer with ‘sigmoid’ activation.

The model is trained using Adam optimizer with binary cross-entropy loss.

The training for 10 epochs along with the model structure is shown below: model1.

summary() model1.


2) The validation accuracy is reaching up to 77% with the basic LSTM-based model.

Let’s not implement a simple Bahdanau Attention layer in Keras and add it to the LSTM layer.

To implement this, we will use the default Layer class in Keras.

We will define a class named Attention as a derived class of the Layer class.

We need to define four functions as per the Keras custom layer generation rule.

These are build(),call (), compute_output_shape() and get_config().

Inside build (), we will define our weights and biases, i.


, Wa and B as discussed previously.

If the previous LSTM layer’s output shape is (None, 32, 100) then our output weight should be (100, 1) and bias should be (100, 1) dimensional.

ef build(self,input_shape):        self.


add_weight(name=”att_weight”,shape=(input_shape[-1],1),initializer=”normal”)        self.


add_weight(name=”att_bias”,shape=(input_shape[1],1),initializer=”zeros”)                 super(attention, self).

build(input_shape) Inside call (), we will write the main logic of Attention.

We simply must create a Multi-Layer Perceptron (MLP).

Therefore, we will take the dot product of weights and inputs followed by the addition of bias terms.

After that, we apply a ‘tanh’ followed by a softmax layer.

This softmax gives the alignment scores.

Its dimension will be the number of hidden states in the LSTM, i.


, 32 in this case.

Taking its dot product along with the hidden states will provide the context vector: def call(self,x):        et=K.





b),axis=-1)        at=K.

softmax(et)        at=K.

expand_dims(at,axis=-1)        output=x*at         return K.

sum(output,axis=1) The above function is returning the context vector.

The complete custom Attention class looks like this: from keras.

layers import Layer import keras.

backend as K class attention(Layer):    def __init__(self,**kwargs):         super(attention,self).

__init__(**kwargs)    def build(self,input_shape):        self.


add_weight(name=”att_weight”,shape=(input_shape[-1],1),initializer=”normal”)        self.


add_weight(name=”att_bias”,shape=(input_shape[1],1),initializer=”zeros”)                 super(attention, self).

build(input_shape)    def call(self,x):        et=K.





b),axis=-1)        at=K.

softmax(et)        at=K.

expand_dims(at,axis=-1)        output=x*at         return K.

sum(output,axis=1)    def compute_output_shape(self,input_shape):         return (input_shape[0],input_shape[-1])    def get_config(self):         return super(attention,self).

get_config() The get_config() method collects the input shape and other information about the model.

Now, let’s try to add this custom Attention layer to our previously defined model.

Except for the custom Attention layer, every other layer and their parameters remain the same.

Remember, here we should set return_sequences=True in our LSTM layer because we want our LSTM to output all the hidden states.

inputs=Input((features,)) x=Embedding(input_dim=vocab_length+1,output_dim=32,input_length=features,            embeddings_regularizer=keras.



001))(inputs) att_in=LSTM(no_of_neurons,return_sequences=True,dropout=0.


2)(x) att_out=attention()(att_in) outputs=Dense(1,activation=sigmoid,trainable=True)(att_out) model=Model(inputs,outputs) model.

summary() model.

compile(loss=binary_crossentropy, optimizer=adam, metrics=[acc]) model.


2) There is indeed an improvement in the performance as compared to the previous model.

The validation accuracy now reaches up to 81.

25 % after the addition of the custom Attention layer.

With further pre-processing and a grid search of the parameters, we can definitely improve this further.

Different researchers have tried different techniques for score calculation.

There are different variants of Attention model(s) according to how the score, as well as the context vector, are calculated.

There are other variants also, which we will discuss next.

  Global vs.

Local Attention So far, we have discussed the most basic Attention mechanism where all the inputs have been given some importance.

Let’s take things a bit deeper now.

The term “global” Attention is appropriate because all the inputs are given importance.

Originally, the Global Attention (defined by Luong et al 2015) had a few subtle differences with the Attention concept we discussed previously.

The differentiation is that it considers all the hidden states of both the encoder LSTM and decoder LSTM to calculate a “variable-length context vector ct, whereas Bahdanau et al.

used the previous hidden state of the unidirectional decoder LSTM and all the hidden states of the encoder LSTM to calculate the context vector.

In encoder-decoder architectures, the score generally is a function of the encoder and the decoder hidden states.

Any function is valid as long as it captures the relative importance of the input words with respect to the output word.

When a “global” Attention layer is applied, a lot of computation is incurred.

This is because all the hidden states must be taken into consideration, concatenated into a matrix, and multiplied with a weight matrix of correct dimensions to get the final layer of the feedforward connection.

So, as the input size increases, the matrix size also increases.

In simple terms, the number of nodes in the feedforward connection increases and in effect it increases computation.

Can we reduce this in any way?.Yes!.Local Attention is the answer.

Intuitively, when we try to infer something from any given information, our mind tends to intelligently reduce the search space further and further by taking only the most relevant inputs.

The idea of Global and Local Attention was inspired by the concepts of Soft and Hard Attention used mainly in computer vision tasks.

Soft Attention is the global Attention where all image patches are given some weight; but in hard Attention, only one image patch is considered at a time.

But local Attention is not the same as the hard Attention used in the image captioning task.

On the contrary, it is a blend of both the concepts, where instead of considering all the encoded inputs, only a part is considered for the context vector generation.

This not only avoids expensive computation incurred in soft Attention but is also easier to train than hard Attention.

How can this be achieved in the first place?.Here, the model tries to predict a position pt in the sequence of the embeddings of the input words.

Around the position pt, it considers a window of size, say, 2D.

Therefore, the context vector is generated as a weighted average of the inputs in a position [pt – D,pt + D] where D is empirically chosen.

Furthermore, there can be two types of alignments: Monotonic alignment, where pt is set to t, assuming that at time t, only the information in the neighborhood of t matters Predictive alignment where the model itself predicts the alignment position as follows: where ‘Vp’ and ‘Wp’ are the model parameters that are learned during training and ‘S’ is the source sentence length.

Clearly, pt ε [0,S].

The figures below demonstrate the difference between the Global and Local Attention mechanism.

Global Attention considers all hidden states (blue) whereas local Attention considers only a subset:   Transformers – Attention is All You Need The paper named “Attention is All You Need” by Vaswani et al is one of the most important contributions to Attention so far.

They have redefined Attention by providing a very generic and broad definition of Attention based on key, query, and values.

They have referenced another concept called multi-headed Attention.

Let’s discuss this briefly.

First, let’s define what “self-Attention” is.

Cheng et al, in their paper named “Long Short-Term Memory-Networks for Machine Reading”, defined self-Attention as the mechanism of relating different positions of a single sequence or sentence in order to gain a more vivid representation.

Machine reader is an algorithm that can automatically understand the text given to it.

  We have taken the below picture from the paper.

The red words are read or processed at the current instant, and the blue words are the memories.

The different shades represent the degree of memory activation.

When we are reading or processing the sentence word by word, where previously seen words are also emphasized on, is inferred from the shades, and this is exactly what self-Attention in a machine reader does.

Previously, to calculate the Attention for a word in the sentence, the mechanism of score calculation was to either use a dot product or some other function of the word with the hidden state representations of the previously seen words.

In this paper, a fundamentally same but a more generic concept altogether has been proposed.

Let’s say we want to calculate the Attention for the word “chasing”.

The mechanism would be to take a dot product of the embedding of “chasing” with the embedding of each of the previously seen words like “The”, “FBI”, and “is”.

Now, according to the generalized definition, each embedding of the word should have three different vectors corresponding to it, namely Key, Query, and Value.

We can easily derive these vectors using matrix multiplications.

Whenever we are required to calculate the Attention of a target word with respect to the input embeddings, we should use the Query of the target and the Key of the input to calculate a matching score, and these matching scores then act as the weights of the Value vectors during summation.

Now, you might ask what these Key, Query and Value vectors are.

These are basically abstractions of the embedding vectors in different subspaces.

Think of it in this way: you raise a query; the query hits the key of the input vector.

The Key can be compared with the memory location read from, and the value is the value to be read from the memory location.

Simple, right?.If the dimension of the embeddings is (D, 1) and we want a Key vector of dimension (D/3, 1), we must multiply the embedding by a matrix Wk of dimension (D/3, D).

So, the key vector becomes K=Wk*E.

Similarly, for Query and Value vectors, the equations will be Q=Wq*E, V=Wv*E (E is the embedding vector of any word).

Now, to calculate the Attention for the word “chasing”, we need to take the dot product of the query vector of the embedding of “chasing” to the key vector of each of the previous words, i.


, the key vectors corresponding to the words “The”, “FBI” and “is”.

Then these values are divided by D (the dimension of the embeddings) followed by a softmax operation.

So, the operations are respectively: softmax(Q”chasing” .

K”The” / D) softmax(Q”chasing” .

K”FBI” / D) softmax(Q”chasing” .

K”is” / D) Basically, this is a function f(Qtarget, Kinput) of the query vector of the target word and the key vector of the input embeddings.

It doesn’t necessarily have to be a dot product of Q and K.

Anyone can choose a function of his/her own choice.

Next, let’s say the vector thus obtained is [0.

2, 0.

5, 0.


These values are the “alignment scores” for the calculation of Attention.

These alignment scores are multiplied with the value vector of each of the input embeddings and these weighted value vectors are added to get the context vector: C”chasing”= 0.

2 * VThe + 0.

5* V”FBI” + 0.

3 * V”is” Practically, all the embedded input vectors are combined in a single matrix X, which is multiplied with common weight matrices Wk, Wq, Wv to get K, Q and V matrices respectively.

Now the compact equation becomes: Z=Softmax(Q*KT/D)V Therefore, the context vector is a function of Key, Query and Value F(K, Q, V).

The Bahdanau Attention or all other previous works related to Attention are the special cases of the Attention Mechanisms described in this work.

The salient feature/key highlight is that the single embedded vector is used to work as Key, Query and Value vectors simultaneously.

In multi-headed Attention, matrix X is multiplied by different Wk, Wq and Wv matrices to get different K, Q and V matrices respectively.

And we end up with different Z matrices, i.


, embedding of each input word is projected into different “representation subspaces”.

In, say, 3-headed self-Attention, corresponding to the “chasing” word, there will be 3 different Z matrices also called “Attention Heads”.

These Attention heads are concatenated and multiplied with a single weight matrix to get a single Attention head that will capture the information from all the Attention heads.

The picture below depicts the multi-head Attention.

You can see that there are multiple Attention heads arising from different V, K, Q vectors, and they are concatenated: The actual transformer architecture is a bit more complicated.

You can read it in much more detail here.

This image above is the transformer architecture.

We see that something called ‘positional encoding’ has been used and added with the embedding of the inputs in both the encoder and decoder.

The models that we have described so far had no way to account for the order of the input words.

They have tried to capture this through positional encoding.

This mechanism adds a vector to each input embedding, and all these vectors follow a pattern that helps to determine the position of each word, or the distances between different words in the input.

  As shown in the figure, on top of this positional encoding + input embedding layer, there are two sublayers: In the first sublayer, there is a multi-head self-attention layer.

There is an additive residual connection from the output of the positional encoding to the output of the multi-head self-attention, on top of which they have applied a layer normalization layer.

The layer normalization is a technique (Hinton, 2016) similar to batch normalization where instead of considering the whole minibatch of data for calculating the normalization statistics, all the hidden units in the same layer of the network have been considered in the calculations.

This overcomes the drawback of estimating the statistics for the summed input to any neuron over a minibatch of the training samples.

Thus, it is convenient to use in RNN/LSTM In the second sublayer, instead of the multi-head self-attention, there is a feedforward layer (as shown), and all other connections are the same On the decoder side, apart from the two layers described above, there is another layer that applies multi-head Attention on top of the encoder stack.

Then, after a sublayer followed by one linear and one softmax layer, we get the output probabilities from the decoder.

  Attention Mechanism in Computer Vision You can intuitively understand where the Attention mechanism can be applied in the NLP space.

We want to explore beyond that.

So in this section, let’s discuss the Attention mechanism in the context of computer vision.

We will reference a few key ideas here and you can explore more in the papers we have referenced.

  Image Captioning-Show, Attend and Tell (Xu et al, 2015): In image captioning, a convolutional neural network is used to extract feature vectors known as annotation vectors from the image.

This produces L number of D dimensional feature vectors, each of which is a representation corresponding to a part of an image.

In this work, features have been extracted from a lower convolutional layer of the CNN model so that a correspondence between the extracted feature vectors and the portions of the image can be determined.

On top of this, an Attention mechanism is applied to selectively give more importance to some of the locations of the image compared to others, for generating caption(s) corresponding to the image.

A slightly modified version of Bahdanau Attention has been used here.

Instead of taking a weighted sum of the annotation vectors (similar to hidden states explained earlier), a function has been designed that takes both the set of annotation vectors and the alignment vector, and outputs a context vector instead of simply creating a dot product (mentioned above).

  Image Generation – DRAW – Deep Recurrent Attentive Writer Although this work by Google DeepMind is not directly related to Attention, this mechanism has been ingeniously used to mimic the way an artist draws a picture.

This is done by drawing parts of the image sequentially.

Let’s discuss this paper briefly to get an idea about how this mechanism alone or combined with other algorithms can be used intelligently for many interesting tasks.

The main idea behind this work is to use a variational autoencoder for image generation.

Unlike a simple autoencoder, a variational autoencoder does not generate the latent representation of a data directly.

Instead, it generates multiple Gaussian distributions (say N number of Gaussian distributions) with different means and standard deviations.

From these N number of Gaussian distributions, an N element latent vector is sampled, and this sample is fed to the decoder for the output image generation.

Note that Attention-based LSTMs have been used here for both encoder and decoder of the variational autoencoder framework.

But what is that?.The main intuition behind this is to iteratively construct an image.

At every time step, the encoder passes one new latent vector to the decoder and the decoder improves the generated image in a cumulative fashion, i.


the image generated at a certain time step gets enhanced in the next timestep.

It is like mimicking an artist’s act of drawing an image step by step.

But the artist does not work on the entire picture at the same time, right?.

He/she does it in parts – if he is drawing a portrait, at an instant he/she does not draw the ear, eyes or other parts of a face together.

He/she finishes drawing the eye and then moves on to another part.

If we use a simple LSTM, it will not be possible to focus on a certain part of an image at a certain time step.

Here is how Attention becomes relevant.

At both the encoder and decoder LSTM, one Attention layer (named “Attention gate”) has been used.

So, while encoding or “reading” the image, only one part of the image gets focused on at each time step.

And similarly, while writing, only a certain part of the image gets generated at that time-step.

The below image has been taken from the referenced paper.

It shows how DRAW generates MNIST images in a step-by-step process:   End Notes This was quite a comprehensive look at the popular Attention mechanism and how it applies to deep learning.

I’m sure you must have gathered why this has made quite a dent in the deep learning space.

It is extraordinarily effective and has already penetrated multiple domains.

This Attention mechanism has uses beyond what we mentioned in this article.

If you have used it in your role or any project, we would love to hear from you.

Let us know in the comments section below and we’ll connect!.  About the Authors Prodip Hore – Research Director of the Machine Learning & AI team, American Express Prodip received his M.


and a Ph.


degree in Computer Science from the University of South Florida, Tampa, and has a B.

Tech in Computer Science from IEM Salt Lake, Kolkata.

Prior to joining Amex, he was a Lead Scientist at FICO, San Diego.

Currently, he is the Research Director of the Machine Learning & AI team at American Express, Gurgaon.

  Prodip has authored a number of conference papers, patents and a book chapter, and his publications have appeared in many reputed journals and featured at several conferences, including the ‘Pattern Recognition Journal’, ‘Journal of Signal Processing Systems’, ‘IEEE International Conference on Systems, Man, and Cybernetics’, ‘IEEE International Conference on Fuzzy Systems’, and ‘North American Fuzzy Information Processing Society’.

His interests include machine learning, image processing, boosting, deep learning and neural networks, natural language processing, and online and streaming algorithms.

Sayan Chatterjee – Research Engineer, American Express ML & AI Team Sayan Chatterjee completed his B.


in Electrical Engineering and M.

Tech in Computer Science from Jadavpur University and Indian Statistical Institute, Kolkata, respectively.

He is currently working as a Research Engineer on the American Express ML & AI Team, Gurgaon.

Before joining American Express, he worked at PwC India as an Associate in the Data & Analytics practice.

His research interests are in deep learning, statistical learning, computer vision, natural language processing, etc.

You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).


. More details

Leave a Reply