Art of Vector Representation of Words

Art of Vector Representation of WordsASHISH RANABlockedUnblockFollowFollowingDec 5Expressing power of notations used to represent a vocabulary of a language has been a great deal of interest in the field of linguistics..These systems represent each and every word of a vocabulary in the form a vector and create a finite vector space.Let’s see an example of one-hot representation of words..Hence, our search for finding more powerful representations must go on.Also, notice that the dot product between the rows of the the matrix Wword=UΣ is the same as the dot product between the rows of X̂hat.Wword = UΣ ∈ R m×k is taken as the representation of the m words in the vocabulary and Wcontext = V is taken as the representation of the context words.Continuous Bag of WordsThe methods we have seen are count based models like SVD as it uses co-occurrence count which uses the classical statistic based NLP principles..Now, we will move onto prediction based model which directly learn word representations..A ont-to-one mapping exists b/w words & Wcontext’s columns.We can treat the i-th column of Wcontext as the representation of context i..This clearly shows the weight parameters are being represented as word vector representation in a neural network architecture.Having understood the simplicity of interpretation behind the parameters, our aim now is to learn these parameters..As, training objective ensures that the cosine similarity between word (vw) and context word (uc) is maximized..Also, neural network helps in learning much simpler and abstract vector representations of words.In practice, more than one words are used in the window size, it common to use ‘d’ window size depending on the use-case..We must explore some other model mitigating this bottleneck step.Skip-Gram ModelThis model predicts context words with respect to given input words..The role of context and word has changed to an almost opposite sense..Now, with given input words as one hot representations our aim is to predict context word related to it..This opposite relation b/w CBOW model & Skip-Gram model will become more clear below.Given a corpus the model loops on the words of each sentence and either tries to use the current word of to predict its neighbors(its context), in which case the model is called “Skip-Gram”, or it uses each of these contexts to predict the current word, in which case the model is called “Continuous Bag Of Words” (CBOW).‘on’ as input word, probabilities of context words related to it are predicted by this network.Train a simple neural network with a single hidden layer to perform a certain task, but then we’re not actually going to use that neural network for the task we trained it on every single time for new task of modeling!.Instead, the goal is actually just to learn the weights of the hidden layer which will be word vectors as stated by mathematics shown in above section.In the simple case when there is only one context word, we will arrive atthe same update rule for uc as we did for vw earlier.. More details

Leave a Reply