Word2Vec For Phrases — Learning Embeddings For More Than One Word

The training phase we iterate through the tokens in the corpus (the target word) and look at a window of size k (k words to each side of the target word, with the values between 2–10 in general).Image sourceAt the end of the training, we will get from the network the following embedding matrix:Emdedding matrix after Word2Vec trainingNow, each word will not be represented by a discrete and sparse vector, but by a d-dimension continuous vector, and the meaning of each word will be captured by its relation to other words [5]..The reason behind this is that in training time, if two target words share the some context, intuitively the weight of the network for this two target words will be close to each other and thus their matching vectors..Thus, we get a distributional representation for each word in the corpus, in contrast to count based approaches (like BOW and TF-IDF)..Because of the distributional behavior, a specific dimension in the vector doesn’t give any valuable information, but looking the (distributional) vector as a whole, one can perform many similarity tasks..For example, we get that V(“King”)-V(“Man”)+V(“Woman) ~= V(“Queen”) and V(“Paris”)-V(“France)+V(“Spain”) ~= V(“Madrid”)..In addition, we can perform similarity measures, like cosine-similarity, between the vectors and get that the vector of the word “president” will be close to “Obame”, “Trump”, “CEO”, “chairman”, etc.As seen above, we can perform many similarity tasks on words using Word2Vec..But, as we mentioned above, we want to do the same for more than one word.Learning Phrases From Unsupervised Text (Collocation Extraction)We can easily create bi-grams with our unsupervised corpus and take it as an input to Word2Vec..For example, the sentence “I walked today to the park” will be converted to “I_walked walked_today today_to to_the the_park” and each bi-gram will be treated as a uni-gram in the Word2Vec training phrase..It will work, but there are some problems with this approach:It will learn embeddings only for bi-grams, while many of this bi-grams are not really meaningful (for example, “walked_today”) and we will miss embeddings for uni-gram, like “walked” and “today”.Working only with bi-grams creates a very sparse corpus..Think for example about the above sentence “I walked today to the park”..Let’s say the target word is “walked_today”, this term is not very common in the corpus and we will not have many context examples to learn a representative vector for this term.So, how we overcome this problem?.how do we extract only meaningful terms while keeping words as uni-gram if their mutual information is strong enough?. More details

Leave a Reply