Visualizing ELMo Contextual Vectors

Visualizing ELMo Contextual VectorsContextual vectors can be useful for word sense disambiguation.

Henry ChangBlockedUnblockFollowFollowingApr 15Issue with Word EmbeddingThere are difficulties in word embedding that deals with world senses.

No matter how many senses a word has, word embedding method embeds all senses into a single vector representation.

This can cause a problem for downstream NLP tasks like document classification.

For example, the word bank can refer to a financial institution or a slope land besides water.

According to word embedding demo, using a trained model on Google News, the top 5 nearest words of bank in vector space are banks, banking, Bank, lender and banker.

We can see that the semantic of bank here means financial institution and does not capture the meaning of a slope land besides water.

Therefore, the text “Sandy walks along the river bank” would be misclassified to “Financial institution” class in document classification because the word vector of bank points to the meaning of a financial institution.

Contextual VectorsThe contextual vector of a word, on the other hand, can capture different senses of a word.

As the name suggests, vector representation of a contextual word vector depends on its neighbor words in the sentence.

So the word bank in “I withdraw money in the bank”, and “She had a nice walk along the river bank” will have very different word vectors.

Based on recent research, adding contextual vectors into various NLP tasks such as textual entailment, named entity extraction, and question answering, has shown to massively improve the state-of-the-art results.

These contextual vectors are the output of pre-trained language models.

Details on pre-trained language model can be found in the paper ELMo or BERT.

Below I will use the ELMo model to generate contextual vectors.

Visualizing ELMo Contextual VectorsLet’s try using the ELMo model to generate contextual vectors and using PCA to project the vectors to a 2D space for visualization.

In the ELMo paper, there are 3 layers of word embedding, layer zero is the character-based context independent layer, followed by two Bi-LSTM layers.

The authors have empirically shown that the word vectors generated from the first Bi-LSTM layer can better capture the syntax, and the second layer can capture the semantics better.

We will visualize both layer 1 and layer 2 contextual vectors for three words that have multiple senses— bank, work, and plant.

BankFirst, let’s revisit the word bank.

I’ve picked 5 sentences that contain the word bank, and bank can be in either one of the two senses, a financial institution or a slope land besides water.

Here are the 5 sentences:1.

The river bank was not clean2.

One can deposit money at the bank3.

I withdrew cash from the bank4.

He had a nice walk along the river bank5.

My wife and I have a joint bank accountBelow are the projection results using PCA.

Each color point represents the word vector of bank in its context:We can cluster the above contextual vectors into 2 groups.

The word vectors of bank in the upper right cluster mean a slope land besides water, while the bottom left cluster has the meaning of a financial institution.

Similarly, we can see there are 2 clusters in the layer 2 ELMo vector space.

By using contextual vectors, bank has different word vectors depending on different contexts, and the words with the same sense will be close to each other in vector space!WorkWork is another word that has multiple senses.

Work as a noun means something done or made, and as a verb means to work.

Here are the 5 sentences that contain either one of the two senses mentioned above:1.

I like this beautiful work by Andy Warhol2.

Employee works hard every day3.

My sister works at Starbucks4.

This amazing work was done in the early nineteenth century5.

Hundreds of people work in this buildingBelow are the projection results using PCA.

Each color point represents the word vector of work in its context:Looking at layer 1 vectors, we can not immediately tell one cluster apart from the other.

However, we know that the left three vectors have work as a verb, which means to work, and the right 2 vectors have work as a noun meaning something done or made.

For layer 2 vectors, we observe clear clustering of work as a verb in the lower right corner and the another cluster of work as a noun in the upper left corner.

PlantPlant can mean “a living organism” or “place a seed in the ground.

” Here are the 5 sentences that contain either one of the two senses mentioned above:1.

The gardener planted some trees in my yard2.

I plan to plant a Joshua tree tomorrow3.

My sister planted a seed and hopes it will grow to a tree4.

This kind of plant only grows in the subtropical region5.

Most of the plants will die without waterBelow are the projection results using PCA, each color point represents the word vector of plant in its context:The upper right cluster are the vectors which has the meaning of placing a seed on the ground for plant.

For the lower left cluster, the word plant means a living organism.

Similar results can be found in this figure.

The upper left 3 vectors form a cluster of having plant of the meaning “place a seed in the ground”, and the lower right 2 vectors form a cluster of having the word plant mean living organism.

An interesting point is that in all 3 experiment above, the cluster formations seem to be more clear and the distance between cluster centroid is larger in the layer 2 ELMo vector space compared to layer 1.

The ELMo paper mentions that using the second layer in the Word Sense Disambiguation task results in a higher F1 score than using the first layer.

Our observation of the greater distance in the layer 2 vector space shows a possible explanation for the paper’s finding.

The code for generating the above figures can be found in this github repository.

ConclusionBy visualizing the contextual vectors of words with multiple senses, we empirically showed that:Contextual word vectors with the same sense will form a cluster.

We can then calculate the cluster centroid of each sense, and use a simple 1-nearest neighbor approach to disambiguate word sense in a sentence.

ELMo layer 2 word vectors with same sense form clear cluster and the distance between cluster centroid is larger than layer 1.

Any suggestions or comments for this post are more than welcome!.

. More details

Leave a Reply