Different techniques to represent words as vectors (Word Embeddings)

Different techniques to represent words as vectors (Word Embeddings)From Count Vectorizer to Word2VecKaran BhanotBlockedUnblockFollowFollowingJun 7Photo by Romain Vignes on UnsplashCurrently, I’m working on a Twitter Sentiment Analysis project.

While reading about how I could input text to my neural network, I identified that I had to convert the text of each tweet into a vector of a specified length.

This would allow the neural network to train on the tweets and correctly learn sentiment classification.

Thus, I jot down to take a thorough analysis of the various approaches I can take to convert the text into vectors — popularly referred to as Word Embeddings.

Word embedding is the collective name for a set of language modelling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

 — WikipediaIn this article, I’ll explore the following word embedding techniques:Count VectorizerTF-IDF VectorizerHashing VectorizerWord2VecSample text dataI’m creating 4 sentences on which we’ll apply each of these techniques and understand how they work.

For each of the techniques, I’ll use lowercase words only.

Count VectorizerPhoto by Steve Johnson on UnsplashThe most basic way to convert text into vectors is through a Count Vectorizer.

Step 1: Identify unique words in the complete text data.

In our case, the list is as follows (17 words):['ended', 'everyone', 'field', 'football', 'game', 'he', 'in', 'is', 'it', 'playing', 'raining', 'running', 'started', 'the', 'towards', 'was', 'while']Step 2: For each sentence, we’ll create an array of zeros with the same length as above (17)Step 3: Taking each sentence one at a time, we’ll read the first word, find it’s total occurrence in the sentence.

Once we have the number of times it appears in that sentence, we’ll identify the position of the word in the list above and replace the same zero with this count at that position.

This is repeated for all words and for all sentencesExampleLet’s take the first sentence, He is playing in the field.

Its vector is [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

The first word is He.

It’s total count in the sentence is 1.

Also, in the list of words above, its position is 6th from the starting (all are lowercase).

I’ll just update its vector and it will now be:[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]Considering second word, which is is, the vector becomes:[0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]Similarly, I’ll update the rest of the words as well and the vector representation for the first sentence would be:[0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0]The same will be repeated for all other sentences as well.

Codesklearn provides the CountVectorizer() method to create these word embeddings.

After importing the package, we just need to apply fit_transform() on the complete list of sentences and we get the array of vectors of each sentence.

The output in the above gist shows the vector representations of each sentence.

TF-IDF VectorizerPhoto by Devin Avery on UnsplashWhile Count Vectorizer converts each sentence into its own vector, it does not consider the importance of a word across the complete list of sentences.

For example, He is in two sentences and it provides no useful information in differentiating between the two.

Thus, it should have a lower weight in the overall vector of the sentence.

This is where the TF-IDF Vectorizer comes into the picture.

TF-IDF is a product of two parts:TF (Term Frequency) — It is defined as the number of times a word appears in the given sentence.

IDF (Inverse Document Frequency) — It is defined as the log to the base e of number of the total documents divided by the documents in which the word appears.

Step 1: Identify unique words in the complete text data.

In our case, the list is as follows (17 words):['ended', 'everyone', 'field', 'football', 'game', 'he', 'in', 'is', 'it', 'playing', 'raining', 'running', 'started', 'the', 'towards', 'was', 'while']Step 2: For each sentence, we’ll create an array of zeros with the same length as above (17)Step 3: For each word in each sentence, we’ll calculate the TF-IDF value and update the corresponding value in the vector of that sentenceExampleWe’ll first define an array of zeros for all the 17 unique words in all sentences combined.

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]I’ll take the word he in the first sentence, He is playing in the field and apply TF-IDF for it.

The value will then be updated in the array for the sentence and repeated for all words.

Total documents (N): 4Documents in which the word appears (n): 2Number of times the word appears in the first sentence: 1Number of words in the first sentence: 6Term Frequency(TF) = 1Inverse Document Frequency(IDF) = log(N/n) = log(4/2) = log(2)TF-IDF value = 1 * log(2) = 0.

69314718Updated vector:[0, 0, 0, 0, 0, 0.

69314718, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]The same will get repeated for all other words.

However, some libraries may use different methods to calculate this value.

For example, sklearn, calculates the Inverse Document Frequency as:IDF = (log(N/n)) + 1Thus, the TF-IDF value would be:TF-IDF value = 1 * (log(4/2) + 1) = 1 * (log(2) + 1) = 1.

69314718The process when repeated would represent the vector for first sentence as:[0, 0, 1.

69314718, 0, 0, 1.

69314718, 1.

69314718, 1.

69314718, 0, 1.

69314718, 0, 0, 0, 1, 0, 0, 0]Codesklearn provides the method TfidfVectorizer to calculate the TF-IDF values.

However, it applies l2 normalization on it which I’d ignore using the flag value None and keep smooth_idf flag as false so the above method is used by it for IDF calculation.

The output in the above gist shows the vector representations of each sentence.

Hashing VectorizerPhoto by Nick Hillier on UnsplashThis vectorizer is very useful as it allows us to convert any word into it’s hash and does not require the generation of any vocabulary.

Step 1: Define the size of vector to be created for each sentenceStep 2: Apply the hashing algorithm (like MurmurHash) to the sentenceStep 3: Repeat step 2 for all sentencesCodeAs the process is simply the application of a hash function, we can simply take a look at the code.

I’ll use HashingVectorizer method from sklearn.

The normalization will be removed by setting it to none.

Given that, for both the vectorization techniques discussed above, we’ve had 17 columns in each vector, I’ll set the number of features 17 here as well.

This will generate the necessary hashing value vector.

Word2VecPhoto by Mahesh Ranaweera on UnsplashThese are a set of neural network models that have the aim to represent words in the vector space.

These models are highly efficient and performant in understanding the context and relation between words.

Similar words are placed close together in the vector space while dissimilar words are placed wide apart.

It is so amazing to represent words that it is even able to identify key relationships such that:King – Man + Woman = QueenIt is able to decipher that what a Man is to a King, a Woman is to a Queen.

The respective relationships could be identified through these models.

There are two models in this class:CBOW (Continuous Bag of Words): The neural network takes a look at the surrounding words (say 2 to the left and 2 to the right) and predicts the word that comes in betweenSkip-grams: The neural network takes in a word and then tries to predict the surrounding wordsThe neural network has one input layer, 1 hidden layer and 1 output layer to train on the data and build the vectors.

As it’s the basic functionality on how a neural network works, I’ll skip the step by step process.

CodeTo implement the word2vec model, I’ll use the gensim library which provides many features in the model such as finding the odd one out, most similar words etc.

However, it does not lowercase/tokenize the sentences, so I do the same.

The tokenized sentences are then passed to the model.

I’ve set the size of vector to be 2, window to be 3 which defines the distance upto which to look and sg = 0 uses the CBOW model.

I used the most_similar method to find all similar words to the word football and then print out the most similar.

For different trainings, we’ll get different results but in the last case I tried I got the most similar word to be game.

The dataset here is just of 4 sentences.

If we’d increase the same, the neural network will be able to better find relationships.

ConclusionThere we have it.

We’ve looked at the 4 ways for word embeddings and how we can use code to implement the same.

If you have any thoughts, ideas and suggestions, do share and let me know.

Thanks for reading!.

. More details

Leave a Reply