Attention in Neural Networks

Let’s look at another example, “Post photos in your Dropbox folder to Instagram”.

Compared to the previous one, here “Instagram” is the most relevant for action and “Dropbox” is the trigger.

The same word can be either the trigger or the action.

So determining what role the word plays require us to investigate how the prepositions like “to” are used in such sentences.

The paper introduces a “Latent Attention” model to do this.

Fig 4: “Latent Attention” presented by Chen et al.

in this paperA “J” dimensional “Latent attention” vector is prepared — each dimension here represents a word, and the softmax gives a sense of relative importance across the words in the vector.

Input sequence is of length “J” (i.

e “J” words).

Each word represented by a “d” dimensional embedding vector.

The entire sequence is therefore a d x J matrix.

The product of this matrix with a trainable vector “u” of length “d” is computed, with a softmax over it.

This Gives the “Latent attention” vector of length “J”Next, “Active Attention” is prepared similar to above, but instead of using a “d” dimensional vector like “u”, a “d x J” dimensional trainable matrix V is used, resulting in a “J x J” Active attention matrix.

Column-wise softmax is done between the dimensions of each word.

The “Active Weights” are then computed as the product of these two.

Another set of word embeddings are then weighted by these “Active Weights” to derive the output which is the softmaxed to arrive at the predictions.

To me, the derivation of the active weights as a product of vectors representing each word in the input and the latent attention vector that represents the importance across the words is a form of “self attention”, but more on self attention later.

Attention Based Convolutional Neural NetworkIn this paper Yin et al presented ABCNN — Attention Based CNN to model a pair of sentences, used in answer selection, paraphrase identification and textual entailment tasks.

The key highlight of the proposed attention based model was that it considers the impact/relationship/influence that exists between the different parts or words or whole of one input sentence with the other, and provides an interdependent sentence pair representation that can be used in subsequent tasks.

Let’s take a quick look at the base network first before looking at how attention was introduced into it.

Fig 5: Yin et al.

in this paperInput Layer: Starting with two sentences s0 and s1 having 5 and 7 words respectively.

Each word is represented by a embedding vector.

If you are counting the boxes, then Fig 5 says the embedding vector is of length 8.

So s0 is a 8 x 5 rank 2 tensor, s1 is a 8 x 7 rank 2 tensor.

Convolution Layer(s): There could be one or more convolution layers.

The output of previous conv layer will be the input for current conv layer.

This is referred to as the “representation feature map”.

For the first conv layer, this will be the matrix representing the input sentence .

The convolution layer applies a filter of width 3.

This means the convolution operation is performed on s0 which has 5 words 7 times (xx1, x12, 123, 234, 345, 45x, 5xx), creating a feature map with 7 columns.

For s1, this becomes a feature map with 9 columns.

The convolution operation performed in each step is “tanh (W.

c+ b)” where “c” is the concatenated embedding of the words in each of the 7 convolution steps (xx1, x12, 123, 234, 345, 45x, 5xx).

In other words, c is a vector of length 24.

If you are counting the boxes, then according to Fig 5, W was of dimension 8 x 24.

Average Pooling Layer(s): The “average pooling layer” is applied does a column wise averaging of ”w” columns, where “w” is the width of the convolution filter used in this layer.

In our example, this was 3.

So to following averages are produced for s0: 123, 234, 345, 456, 567 — transforming the 7 column feature back into 5 columns.

Similarly for s1.

Pooling in last layer: In the last convolution layer, average pooling is done not over “w” columns, but ALL columns, therefore transforming the matrix feature map into a sentence representing vector.

Output Layer: The output layer to handle the sentence representing vectors is chose according to the task, in the figure a logistic regression layer is shown.

Note that the input to the first layer is words, next layer is short phrases (in the example above, a filter width of 3 makes it a phrase of 3 words), next layer is larger phrases and so on until the final layer where the output is a sentence representation.

In other words, with each layer, an abstract representation of lower to higher granularity is produced.

The paper presents three ways in which attention is introduced into this base model.

ABCNN-1Fig 6: ABCNN-1 in this paperIn ABCNN-1, attention is introduced before the convolution operation.

The input representation feature map (described in #2 in based model description, shown as red matrix in Fig 6) for both sentences s0 (8 x 5) and s1 (8 x 7), are “matched” to arrive at the Attention Matrix “A” (5 x 7).

Every cell in the attention matrix, Aij, represents the attention score between the ith word in s0 and jth word in s1.

In the paper this score is calculated as 1/(1 + |x − y|) where | · | is Euclidean distance.

This attention matrix is then transformed back into an “Attention Feature Map”, that has the same dimension as the input representation maps (blue matrix) i.


8 x 5 and 8 x 7 using trainable weight matrices W0 and W1 respectively.

Now the convolution operation is performed on not just the input representation like the base model, but on both the input representation and the attention feature map just calculated.

In other words, instead of using a rank 2 tensor as input as stated in #1 of base model description above, the convolution operation is performed on a rank 3 tensor.

ABCNN-2Fig 7: ABCNN-2 in this paperIn ABCNN-2, attention matrix is prepared not using the input representation feature map as described in ABCNN-2 but on the output of convolution operation, let’s call this “conv feature map”.

In our example this is the 7 and 9 column feature maps representing s0 and s1 respectively.

So therefore, the attention matrix dimensions will also be of different compared to ABCNN-1 — it’s 7 x 9 here.

This attention matrix is then used to derive attention weights by summing all attention values in a given row (for s0) or columns (for s1).

For example, for 1st column in conv feature map for s0, this would be sum of all values in 1st row in attention matrix.

For 1st column in conv feature map for s1, this would be sum of all values in 1st column of attention matrix.

In other words, there is one attention weight for every unit/column in the conv feature map.

The attention weight is then used to “re-weight” the conv feature map columns.

Every column in the pooling output feature map is computed as the attention weighted sum of the “w” conv feature map columns that are being pooled — in our examples above this was 3.

ABCNN-3Fig 8: ABCNN-3 in this paperABCNN-3, simply combines both essentially applying attention to both the input of convolution and to the convolution output while pooling.

Decomposable Attention ModelFor natural language inference, this paper by Parikh et al first creates the attention weights matrix comparing each word in one sentence with all of another and normalized as shown in the image.

But after this, in the next step, the problem is “decomposed into sub-problems” that are solved separately.



a feed forward network is used to take concatenated word embedding and corresponding normalized alignment vector to generate the “comparison vector”.

This comparison vectors for each sentence are then summed to create two aggregate comparison vectors representing each sentence which is then fed through another feed forward network for final classification.

The word order doesn’t matter in this solution and only attention is used.

Fig 9: From this paper by Parikh et alNeural Transducer for Online AttentionFor online tasks, such as real time speech recognition, where we do not have the luxury of processing through an entire sequence this paper by Jaitly et al introduced the Neural Transducer that makes incremental prediction while processing blocks of input at a time, as opposed to encoding or generating attention over the entire input sequence.

The input sequence is divided into multiple blocks of equal length (except possibly the last block) and the Neural Transducer model computes attention only for the inputs in the current block, which is then used to generate the output corresponding to that block.

The connection with prior blocks exists only via the hidden state connections that are part of the RNN on the encoder and decoder side.

While this is similar to an extent to the local attention described earlier, there is no explicit “position alignment” as described there.

Fig 10: Neural Transducer — attending to a limited part of the sequence.

From this paper by Jaitly et alArea AttentionRefer back to Fig 1, an illustration of the base introductory attention model we saw in the earlier post.

A generalized abstraction of alignment is that it is like querying the memory as we generate the output.

The memory is some sort of representation of the input and the query is some sort of representation of output.

In Fig 1, the memory or collection of keys was the encoder hidden states “h”, the blue nodes, and query was the current decoder hidden state “s”, the green nodes.

The derived alignment score is then multiplied with “values” — another representation of the input, the gold nodes in Fig 1.

Area attention is when attention is applied on to an “area”, not necessarily just one item like a vanilla attention model.

“Area” is defined as a group of structurally adjacent items in the memory (i.


the input sequence in a one dimensional input like sentence of words).

An area is formed by combining adjacent items in the memory.

In 2-D case like an image, the area will be any rectangular subset within the image.

Fig 11: Area attention from this paper by Yang et al.

The “key” vector for an area can be defined simply as the mean vector of the key of each item in the area.

In a sequence to sequence translation task, this would be the mean of each of the hidden state vectors involved in the area.

In the definition under “Simple Key Vector” in Fig 11, ”k” is the hidden state vector.

If we are defining an area containing 3 adjacent words, then the mean vector is the mean of the hidden state vectors generated after each of the three words in the encoder.

The “value” on the other hand is defined as the sum of all value vectors in the area.

In our basic example, this will again be the encoder hidden state vectors corresponding to the three words for which the area is being defined.

We can also define a richer representation of the key vector that takes into consideration not just the mean, but also the standard deviation and shape vector as explained in the Fig 11.

Shape vector here is defined as the concatenation of height and width vectors, which in turn are created from actual width and height numbers projected as vectors using embedding matrices, which I presume are learnt with the model.

The key is derived as an output of a single layer perceptron that takes mean, std dev and shape vectors as input.

Once the key and value vectors are defined, the rest of the network could be any attention utilizing model.

If we are using a encoder-decoder RNN as seen in Fig 1, then plugging the derived area based key and value vectors in place of those in Fig 1 will make it an area based attention model.

Reading through these papers gives an interesting perspective on how researchers have used attention mechanisms for various tasks and how the thinking has evolved.

Hopefully this quick study gives a sense of how we could tweak and use one of these or a new variant in our own tasks.

.. More details

Leave a Reply