What the heck is Word EmbeddingLooking at text data through the lens of Neural NetsSamarth AgrawalBlockedUnblockFollowFollowingFeb 10Photo by Dmitry Ratushny on UnsplashWord Embedding => Collective term for models that learned to map a set of words or phrases in a vocabulary to vectors of numerical values.

Neural Networks are designed to learn from numerical data.

Word Embedding is really all about improving the ability of networks to learn from text data.

By representing that data as lower dimensional vectors.

These vectors are called Embedding.

This technique is used to reduce the dimensionality of text data but these models can also learn some interesting traits about words in a vocabularyHow it is done!General approach for dealing with words in your text data is to one-hot encode your text.

You will have tens of thousands of unique words in your text vocabulary.

Computations with such one-hot encoded vectors for these words will be very inefficient because most values in your one-hot vector will be 0.

So, the matrix calculation that will happen in between a one-hot vector and a first hidden layer will result in a output that will have mostly 0 valuesWe use embeddings to solve this problem and greatly improve the efficiency of our network.

Embeddings are just like a fully-connected layer.

We will call this layer as— embedding layer and the weights as — embedding weights.

Now, instead of doing the matrix multiplication between the inputs and hidden layer we directly grab the values from embedding weight matrix.

We can do this because the multiplication of one-hot vector with weight matrix returns the row of the matrix corresponding to the index of ‘1’ input unitSo, we use this Weight Matrix as lookup table.

We encode the words as integers, for example ‘cool’ is encoded as 512, ‘hot’ is encoded as 764.

Then to get hidden layer output value for ‘cool’ we just simply need to lookup the 512th row in the weight matrix.

This process is called Embedding Lookup.

The number of dimension in the hidden layer output is the embedding dimensionTo reiterate :-a) The embedding layer is just a hidden layerb) The lookup table is just a embedding weight matrixc) The lookup is just a shortcut for matrix multiplicationd) The lookup table is trained just like any weight matrixPopular off-the-shelf word embedding models in use today:Word2Vec (by Google)GloVe (by Stanford)fastText (by Facebook)Word2Vec:This model is provided by Google and is trained on Google News data.

This model has 300 dimensions and is trained on 3 million words from google news data.

Team used skip-gram and negative sampling to build this model.

It was released in 2013.

GloVe:Global Vectors for words representation (GloVe) is provided by Stanford.

They provided various models from 25, 50, 100, 200 to 300 dimensions based on 2, 6, 42, 840 billion tokensTeam used word-to-word co-occurrence to build this model.

In other words, if two words co-occur many times, it means they have some linguistic or semantic similarity.

fastText:This model is developed by Facebook.

They provide 3 models with 300 dimensions each.

fastText is able to achieve good performance for word representations and sentence classifications because they are making use of character level representations.

Each word is represented as bag of characters n-grams in addition to the word itself.

For example, for the word partial, with n=3, the fastText representation for the character n-grams is <pa, art, rti, tia, ial, al>.

<and> are added as boundary symbols to separate the n-grams from the word itself.

.