Word Vectors and Lexical Semantics (Part 1)Hafidz ZulkifliBlockedUnblockFollowFollowingMar 27The following are my personal notes based on the Deep NLP course by Oxford University held in 2017.
The material is available at [1].
IntroductionWord Vectors : Representation of word in vector format.
Lexical Semantics : Analysis of word meanings and relationship between them.
Neural network requires vector representation as inputs.
Thus there is a need to change words or sentences into the vectors.
Representing WordsText are merely sequences of discrete symbols (i.
e words).
A simple way to represent them is by one-hot encoding every words in the sentence.
However doing this would require a lot of memory/space as the vector space (made up of one-hot encoded vectors) would effectively be the size of your vocabulary.
The problem with this approach is since each vector is defined a single word, every vectors are then represented in a rather orthogonal way without no clear (weak) relation to each other in terms of semantics.
They are also very sparse from each other.
Thus there is a need for richer representation which is able to express semantic similarity.
Distributional SemanticsDistributional Semantics : A research area that develops and studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data [2].
“You shall know a word by the company it keeps” — J.
R Firth (1957)The above quote, and other analogies similar to it — points out that meaning of words can be understood by looking at how it is being used by the population.
At the same time, we are also interested in reducing the size of our vector space.
This can be done by producing dense vector representation (as opposed to sparse).
Computationally, there are 3 main approaches to doing this.
Count-basedPredictiveTask-basedThe advantage of being able to assign words as vectors is that one can start to objectively measure and compare between the word vectors, either to calculate similarity, distance and others.
Let’s cover the count-based method first.
Count-based MethodDefine basis vocabulary to be used.
Usually they’re chosen based on own experience/intuition or statistics of the corpus.
These vocabulary ideally is informative and has meaning.
Having said that, more often than not, people typically would use the entire sentence and use all the words available to define the vocabulary.
Usually the size of the vocabulary will be limited.
Stop words are normally excluded since they appear a lot in most available corpus.
If we’re to include them, then we’ll have trouble identifying relationship due to the co-occurrence of the stop words all over placeExample:Credits to [1]Having identified the target words and it’s context, we can now represent this as a vector.
Example.
Credits to 1.
Note that the value doesn’t have to ONLY be 1.
As vectors, the words can now be analyzed, perhaps via similarity (most popular method to calculate similarity is cosine distance), distance in vector space and others.
However, there are still some disadvantages:Not all words are equally informative, since some words can simply appear more frequently in the various texts; and by virtue of that, can no longer be uniquely associated to a particular context.
For example, in texts that describes various 4 legged animals — the word run or four legs wouldn’t be able to differentiate the types of animal within the corpus.
They are methods however to overcome this, like TF-IDF or PMI.
In my next post, we’ll explore an easier way to resolve these issues.
Referenceshttps://github.
com/oxford-cs-deepnlp-2017/lectureshttps://en.
wikipedia.
org/wiki/Distributional_semantics.. More details