An introduction to Bag of Words and how to code it in Python for NLP

It creates a vocabulary of all the unique words occurring in all the documents in the training set.In simple terms, it’s a collection of words to represent a sentence with word count and mostly disregarding the order in which they appear.BOW is an approach widely used with:Natural language processingInformation retrieval from documentsDocument classificationsOn a high level, it involves the following steps.Generated vectors can be input to your machine learning algorithm.Let’s start with an example to understand by taking some sentences and generating vectors for those.Consider the below two sentences.1.."John likes to watch movies..Mary likes movies too."2.."John also likes to watch football games."These two sentences can be also represented with a collection of words.1..['John', 'likes', 'to', 'watch', 'movies.', 'Mary', 'likes', 'movies', 'too.']2..['John', 'also', 'likes', 'to', 'watch', 'football', 'games']Further, for each sentence, remove multiple occurrences of the word and use the word count to represent this.1..{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}2..{"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1, "games":1}Assuming these sentences are part of a document, below is the combined word frequency for our entire document..Both sentences are taken into account..{"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1, "also":1,"football":1,"games":1}The above vocabulary from all the words in a document, with their respective word count, will be used to create the vectors for each of the sentences.The length of the vector will always be equal to vocabulary size..In this case the vector length is 11.In order to represent our original sentences in a vector, each vector is initialized with all zeros — [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]This is followed by iteration and comparison with each word in our vocabulary, and incrementing the vector value if the sentence has that word.John likes to watch movies..Mary likes movies too.[1, 2, 1, 1, 2, 1, 1, 0, 0, 0]John also likes to watch football games.[1, 1, 1, 1, 0, 0, 0, 1, 1, 1]For example, in sentence 1 the word likes appears in second position and appears two times.. More details

Leave a Reply