Word Representation in Natural Language Processing Part I

each unique word in the vocabulary is assigned an ID.As result, a simple lookup dictionary will be constructed as shown below, from which one can look up for word IDs.Example of sample lookup dictionary.Then, for each given word, return the corresponding integer representation by looking it up in the dictionary..If the word is not present in the dictionary, the integer corresponding to the Out of Vocabulary token should be returned..In practice, usually the value of Out of Vocabulary token is set to the size of the dictionary plus one i.e..length(dictionary) + 1 .While it is relatively easier approach, it has drawbacks which need to be considered..By treating tokens as integers, the model might incorrectly assume the existence of natural ordering..For example, the dictionary contains entries such as 1: “airport” and 2: “plane” ..The token with greater ID value might be considered as more important by the Deep learning models than the tokens with less values which is a wrong assumption..Models which are trained with this type of data are prone to failure..On the contrary, data with ordinal values such as size measures 1: “small”, 2: “medium”, 3:“large” is suitable for this case..Because there is a natural ordering in the data.One-Hot EncodingThe second approach of word representation is one-hot encoding..The main idea is to create a vocabulary size vector with filled zeros except one..For a single word only corresponding column is filled with the value 1 and the rest are zero valued.. More details

Leave a Reply