NLP: Text Mining Algorithms

NLP: Text Mining AlgorithmsExplaining N-Grams, Bag Of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms and their implementation in PythonFarhad MalikBlockedUnblockFollowFollowingJun 28This article aims to clearly explain the most widely used text mining algorithms used in the NLP projects.

It will explain 3 algorithms including:N-GramsBag of Words (BoW)Term Frequency-Inverse Document Frequency (TF-IDF).

Photo by Sergi Kabrera on Unsplash1.

N-GramsN-Grams is an important concept to understand in text analytics.

Essentially, N-Grams is a set of 1 or more consecutive sequence of items that occur next to each other.

As mentioned above, N is a numerical value that implies the n items of sequence of text.

When we type text in a search engine, we can see the probabilistic model of the search engine starts predicting the next set of words based on the context.

This is known as the Autocomplete feature of the search engines.

N-Grams allows us to build this text mining forecasting model.

N-Grams allows us to predict the next words of a textAs an instance, if the sentence is “FinTechExplained is a publication”, then:1-Gram would be: FinTechExplained, is, a, publication2-Gram would be: FinTechExplained is, is a, a publication3-Gram would be: FinTechExplained is a, is a publicationIn Python, we can implement N-Gram using NLTK library:from nltk.

util import ngramsfrom collections import Countertext = 'FinTechExplained is a publication'1_grams = ngrams(nltk.

word_tokenize(text), 1)2_grams = ngrams(nltk.

word_tokenize(text), 2)3_grams = ngrams(nltk.

word_tokenize(text), 3)2.

Bag of Words (BoW)In this section, I will explain the concept that is gaining its popularity in NLP projects.

It’s known as Bag of Words (Bow).

Essentially the algorithm revolves around the fact that the text needs to be converted into numbers before it can be applied into the mathematical algorithms.

When we convert the text to the numbers, we can apply various techniques.

One of the techniques is to count the occurrence of words in a document.

BoW is all about creating a matrix of words where the words (terms) are represented as the rows and the columns represent the document names.

We can then populate the matrix with the frequency of each term within the document, ignoring the grammar and order of the terms.

The matrix is referred to as the Term Document Matrix (TDM).

Each row is a word vector.

Each column is a document vector.

As an instance, assume you extract the tweets from Twitter and statuses from Facebook that contain the word “NLP”.

You can then tokenise the sentences into words and then populate TDM where the columns will be Facebook and Twitter, and the rows will be the terms (words of the text).

The matrix is then populated with the frequency of each term within a document:We can achieve it using Sci-kit learn library in Pythonimport pandas as pdfrom sklearn.

feature_extraction.

text import CountVectorizerdata = {'twitter':get_tweets(), 'facebook':get_fb_statuses()}vectoriser = CountVectorizer()vec = vectoriser.

fit_transform(data['twitter'].

append(data['facebook']))df = pd.

DataFrame(vec.

toarray().

transpose(), index = vectoriser.

get_feature_names())df.

columns = ['twitter', 'facebook']3.

Term Frequency-Inverse Document Frequency (TF-IDF)In NLP projects, we are required to determine the importance of each word.

TF-IDF is a great statistical measure.

It helps us understand the relevance of the term (word).

For each term in a document, a matrix is computed by performing following 4 steps:Calculate the frequency of a term in a document.

This is known as Term Frequency (TF).

This is achieved by dividing the number of times a term appears in a document divided by the total number of terms in a document.

Calculate the inverse of document frequency of a term.

This is computed by dividing the total number of documents by the number of documents that contain the term.

The inverse is calculated so that we can compute a positive log value.

This is known as Inverse Document Frequency (IDF).

Compute a log of the value computed in step 2.

This will now give us a positive value.

Finally multiply step 1 by step 3.

This is known as TF-IDFRows of the matrix represent the terms and the columns of the matrix are the document names.

To understand it better, let’s assume there are 100 documents.

4 documents contain the term “FinTechExplained”.

The term is mentioned once in the first and second document, twice in the third document and thrice in the fourth document.

Also let’s consider that there are 100 words in each document.

Term frequency for the documents is:Document 1: 1/100 = 0.

01Document 2: 1/100 = 0.

01Document 3: 2/100 = 0.

02Document 4: 3/100 = 0.

032.

The IDF is: 100/4 = 253.

Log of IDF(25) = 1.

3984.

Finally, the TF-IDF of the term “FinTechExplained” in document 1 is 0.

25 (=25*0.

01)We can implement it in Python using Sci-kit learn library:import pandas as pdfrom sklearn.

feature_extraction.

text import TfidfVectorizerdata = {'twitter':get_tweets(), 'facebook':get_fb_statuses()}vectoriser = TfidfVectorizer()vec = vectoriser.

fit_transform(data['twitter'].

append(data['facebook']))df = pd.

DataFrame(vec.

toarray().

transpose(), index = vectoriser.

get_feature_names())df.

columns = ['twitter', 'facebook']SummaryThis article explained the most widely used text mining algorithms used in the NLP projects.

Explaining N-Grams, Bag Of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms and their implementation in Python.

. More details

Leave a Reply