Building Blocks: Text Pre-Processing

')sample_text = """The first time I ate here I honestly was not that impressed.

I decided to wait a bit and give it another chance.

I have recently eaten there a couple of times and although I am not convinced that the pricing is particularly on point the two mushroom and swiss burgers I had were honestly very good.

The shakes were also tasty.

Although Mad Mikes is still my favorite burger around, you can do a heck of a lot worse than Smashburger if you get a craving"""tokenize_sentence = sent_tokenize(sample_text)print (tokenize_sentence)print ('———————————————————.')print ('Following is the list of words tokenized from the sample review sentence.')tokenize_words = word_tokenize(tokenize_sentence[1])print (tokenize_words)Output:Following is the list of sentences tokenized from the sample review['The first time I ate here I honestly was not that impressed.

', 'I decided to wait a bit and give it another chance.

', 'I have recently eaten there a couple of times and although I am not convinced that the pricing is particularly on point the two mushroom and.swiss burgers I had were honestly very good.

', 'The shakes were also tasty.

', 'Although Mad Mikes is still my favorite burger around,.you can do a heck of a lot worse than Smashburger if you get a craving']———————————————————Following is the list of words tokenized from the sample review sentence['I', 'decided', 'to', 'wait', 'a', 'bit', 'and', 'give', 'it', 'another', 'chance', '.

']Stop Words RemovalOften, there are a few ubiquitous words which would appear to be of little value in helping the purpose of analysis but increases the dimensionality of feature set, are excluded from the vocabulary entirely as the part of stop words removal process.

There are two considerations usually that motivate this removal.

Irrelevance: Allows one to analyze only on content-bearing words.

Stopwords, also called empty words because they generally do not bear much meaning, introduce noise in the analysis/modeling processDimension: Removing the stopwords also allows one to reduce the tokens in documents significantly, and thereby decreasing feature dimensionChallenges:Converting all characters into lowercase letters before stopwords removal process can introduce ambiguity in the text, and sometimes entirely changing the meaning of it.

For example, with the expressions “US citizen” will be viewed as “us citizen” or “IT scientist” as “it scientist”.

Since both *us* and *it* are normally considered stop words, it would result in an inaccurate outcome.

The strategy regarding the treatment of stopwords can thus be refined by identifying that “US” and “IT” are not pronouns in the above examples, through a part-of-speech tagging step.

Implementation example:from nltk.

corpus import stopwordsfrom nltk.

tokenize import word_tokenize# define the language for stopwords removalstopwords = set(stopwords.

words("english"))print ("""{0} stop words""".

format(len(stopwords)))tokenize_words = word_tokenize(sample_text)filtered_sample_text = [w for w in tokenize_words if not w in stopwords]print ('.Original Text:')print ('——————.')print (sample_text)print ('.Filtered Text:')print ('——————.')print (' '.

join(str(token) for token in filtered_sample_text))Output:179 stop wordsOriginal Text:——————The first time I ate here I honestly was not that impressed.

I decided to wait a bit and give it another chance.

I have recently eaten there a couple of times and although I am not convinced that the pricing is particularly on point the two mushroom and swiss burgers I had were honestly very good.

The shakes were also tasty.

Although Mad Mikes is still my favorite burger around, you can do a heck of a lot worse than Smashburger if you get a craving Filtered Text:——————The first time I ate I honestly impressed .

I decided wait bit give another chance .

I recently eaten couple times although I convinced pricing particularly point two mushroom swiss burgers I honestly good .

The shakes also tasty .

Although Mad Mikes still favorite burger around , heck lot worse Smashburger get cravingMorphological NormalizationMorphology, in general, is the study of the way words are built up from smaller meaning-bearing units, morphomes.

For example, dogs consists of two morphemes: dog and sTwo commonly used techniques for text normalization are:Stemming: The procedure aims to identify the stem of a word and use it in lieu of the word itself.

The most popular algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective, is Porter’s algorithm.

The entire algorithm is too long and intricate to present here [3], but you can find details hereLemmatization: This process refers to doing things correctly with the use of vocabulary and morphological analysis of words, typically aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun [4]Implementation example:from nltk.

stem import PorterStemmerfrom nltk.

stem import WordNetLemmatizerfrom nltk.

tokenize import word_tokenizeps = PorterStemmer()lemmatizer = WordNetLemmatizer()tokenize_words = word_tokenize(sample_text)stemmed_sample_text = []for token in tokenize_words: stemmed_sample_text.

append(ps.

stem(token))lemma_sample_text = []for token in tokenize_words: lemma_sample_text.

append(lemmatizer.

lemmatize(token)) print ('.Original Text:')print ('——————.')print (sample_text)print ('.Filtered Text: Stemming')print ('——————. More details

Leave a Reply