Processing Text data in Natural Language Processing

` * } @ : ; ^ |= &= += -= = /= *=Morphological NormalizationThis type of normalization is needed when there are multiple representation of a single word.

For example: play, player, playing, played are all mapped to ‘play’.

Though such words mean different but contextually they are all similar.

The step converts all the disparities of a word into their normalized form (also known as stem/lemma).

Normalization is an important step for feature engineering with text as it converts the high dimensional features (n different features) to the low dimensional space (1 feature), which is considered as an ideal task for any ML model.

Inflection: In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood.

An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change.

For example: playing, plays, played- playSentence example: the boy’s car has different colors: the boy car has differ colorApplications: Stemming and Lemmatization are widely used in tagging systems, indexing, SEOs, Web search results, and information retrieval.

For example, searching for fish on Google will also result in fishes, fishing as fish is the stem of both words.

StemmingIt is a process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.

Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis).

Disadvantage: stemming a word or sentence may result in words that are not actual words.

Eg: daily converted to dai when performed stemming which makes no-sense.

Stems are created by removing the suffixes or prefixes used with a word.

Note: Removing suffix from a word is termed as suffix stripping.

Types of Stemming: Stemming is usually based on heuristics, it is far from perfect.

In fact, it commonly suffers from two issues in particular: overstemming and understemming.

(1) Overstemming comes from when too much of a word is cut off.

This can result in nonsensical stems, where all the meaning of the word is lost or muddled.

Or it can result in words being resolved to the same stems, even though they probably should not be.

Take the four words university, universal, universities, and universe.

A stemming algorithm that resolves these four words to the stem “univers” has overstemmed.

While it might be nice to have universal and universe stemmed together and university and universities stemmed together, all four do not fit.

A better resolution might have the first two resolve to “univers” and the latter two resolve to “universi.

” But enforcing rules that make that so might result in more issues arising.

(2) Understemming is the opposite issue.

It comes from when we have several words that actually are forms of one another.

It would be nice for them to all resolve to the same stem, but unfortunately, they do not.

This can be seen if we have a stemming algorithm that stems the words data and datum to “dat” and “datu.

” And you might be thinking, well, just resolve these both to “dat.

”Computer program that stems a word is called as stemmer.

NLTK has stemmer for both English and Non-English.

How stemming works: Stemming algorithms are typically rule-based.

You can view them as heuristic process that sort-of lops off the ends of words.

A word is looked at and run through a series of conditionals that determine how to cut it down.

Different stemming algorithms: For English language, we have PorterStammer, LancasterStammer and SnowballStemmer.

For Non-English language, we have SnowballStemmers ( for Danish, Dutch, English, French, German,Hungarian, Italian, Norwegian, Porter, Portuguese, Romanian, Russian, Spanish, Swedish languages), ISRIStemmer (for Arabic language) , RSLPSStemmer (for Portuguese language).

LemmatizationThe word ‘lemma’ means the canonical form, dictionary form, or citation form of a set of words.

In Lemmatization, root word is called Lemma.

For lemmatization to resolve a word to its lemma, it needs to know its part of speech.

That requires extra computational linguistics power such as a part of speech tagger.

This allows it to do better resolutions (like resolving is and are to “be”).

Difference between Stemming and Lemmatization is as follows:Since lemma is the base form of all its inflectional forms, whereas a stem isn’t, this causes few issues:(a) the stem can be the same for the inflectional forms of different lemmas.

This translates into noise in our search results.

In fact, it is very common to find entire forms as instances of several lemmas(b) the same lemma can correspond to forms with different stems, and we need to treat them as the same word.

For example, in Greek, a typical verb has different stems for perfective forms and for imperfective ones.

If we were using stemming algorithms we won’t be able to relate them with the same verb, but using lemmatization it is possible to do so.

Object StandardizationIf a document contains words or phrases that are not in standard lexical dictionary form then such words are not recognized by search engines and models and must be removed.

The process of removing such words is called as object standardization.

Eg: acronyms (rt-retweet, dm- direct message), hashtags with attached words, and colloquial slangs.

How to do this task?To perform object standardization, task in hand plays an important role.

With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed.

CollocationThere are some combination of words (phrases) in English that makes more sense when co-occur together than they occur individually for a given text, such phrases are termed as collocation.

For example, in hospital: CT SCAN makes more sense than ‘CT’ and ‘SCAN’.

The two most common types of collocation are:(a) bigrams: having two adjacent words together, eg: ‘CT scan’, ‘machine learning’, ‘social media’(b) Trigrams: having three adjacent words together, eg: ‘out of business’, ‘games of thrones’.

Why collocations are important:a) Keyword extraction: identifying the most relevant keywords in documents to assess what aspects are most talked aboutb) Bigrams/Trigrams can be concatenated (e.


social media -> social_media) and counted as one word to improve insights analysis, topic modeling, and create more meaningful features for predictive models in NLP problemsHow to find collocations in a document:For a given sentence, there can be many combination of bi-grams and tri-grams that can be created.

But every bi-gram is not useful.

We have to create a filter method that can pick only relevant bi-grams and tri-grams.

There are different ways to filter out useful and relevant collocations such as: frequency counting, Pointwise Mutual Information (PMI), and hypothesis testing (t-test and chi-square).

Text to features: Feature Engineering on Text DataThe above documentation gives an overview of pre-processing raw textual data.

Now we will move ahead and learn how to extract features from such processed data for further analysis.

Various method to construct textual features are as follows: Syntactical Parsing, Entities extraction, Statistical features, and Word Embeddings.

(1) Syntactical Parsing (Dependency Parsing)It is a task of extracting a dependency parse of a sentence that represents its grammatical structure and defines the relationships between “head” words and words, which modify those heads.

Moreover, Dependency grammar describe the structure of sentences as a graph (tree) and Nodes (v) represent words and Edges (e) represent dependencies.

Example showcasing dependency parsing and POS Tagging(2) Part-of-speech TaggingThe part of speech explains how a word is used in a sentence.

There are eight main parts of speech — nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections.

Noun (N)- Daniel, London, table, dog, teacher, pen, city, happiness, hopeVerb (V)- go, speak, run, eat, play, live, walk, have, like, are, isAdjective(ADJ)- big, happy, green, young, fun, crazy, threeAdverb(ADV)- slowly, quietly, very, always, never, too, well, tomorrowPreposition (P)- at, on, in, from, with, near, between, about, underConjunction (CON)- and, or, but, because, so, yet, unless, since, ifPronoun(PRO)- I, you, we, they, he, she, it, me, us, them, him, her, thisInterjection (INT)- Ouch!.Wow!.Great!.Help!.Oh!.Hey!.Hi!Most POS are divided into sub-classes.

POS Tagging simply means labeling words with their appropriate Part-Of-Speech.

POS tagging is a supervised learning solution.

It uses features like the previous word, next word, is first letter capitalized etc.

NLTK has a function to get pos tags and it works after tokenization process.

The most popular tag set is Penn Treebank tagset.

Most of the already trained taggers for English are trained on this tag set.

Complete list is available @[8].

POS tagging is used for many important purposes in NLP:(1) Word sense disambiguation: Some language words have multiple meanings according to their usage.

For example, in the two sentences below:I.

“Please book my flight for Delhi”II.

“I am going to read this book in the flight”“Book” is used with different context, however the part of speech tag for both of the cases are different.

In sentence I, the word “book” is used as verb, while in II it is used as noun.

(Lesk Algorithm is also used for similar purposes)(2) Improving word-based features: A learning model could learn different contexts of a word only word are used as features.

For example:Sentence -“book my flight, I will read this book”Tokens — (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)But if, the part of speech tag is linked with them, the context is preserved, thus making strong features.

For example:Tokens with POS — (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1), (“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)(3) Normalization and Lemmatization: POS tags are the basis of lemmatization process for converting a word to its base form (lemma).

(4) Efficient stopword removal : P OS tags are also useful in efficient removal of stopwords.

For example, there are some tags which always define the low frequency / less important words of a language.

For example: (IN — “within”, “upon”, “except”), (CD — “one”,”two”, “hundred”), (MD — “may”, “mu st” etc)Entity ExtractionTopic Modelling & Named Entity Recognition are the two key entity detection methods in NLP.

(a) Topic ModellingIt is a process to automatically identify topics present in a text object and to derive hidden patterns present in the text corpus.

This helps in better decision making.

Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”.

A good topic model should suggest words like “health”, “doctor”, “patient”, “hospital” for a topic — Healthcare, and “farm”, “crops”, “wheat” for a topic — “Farming”.

Note: LDA model is used to perform Topic Modelling.

(b) Named Entity ExtractionNamed entity recognition (NER) is the task of tagging entities in text with their corresponding type.

Named Entity Recognition, also known as entity extraction classifies named entities that are present in a text into pre-defined categories like “individuals”, “companies”, “places”, “organization”, “cities”, “dates”, “product terminologies” etc.

It adds a wealth of semantic knowledge to your content and helps you to promptly understand the subject of any given text.

Text MatchingIt is task of finding out how similar two documents are.

There are generally two ways to perform this task:(a) Edit distance: also called as Levenstein Distance.

It compute edit distance between two words/strings.

The algorithm is based on dynamic programming.

Edit Distance formulaIf both the characters matched, simply take diagonal element of the matrix and place that in the current cell.

If characters do not, then find minimum from left, top and right cell and add 1 to it, place final answer in the current cell.

what do we mean by (i,j),(i-1,j-i),(i-i,j)Edit distance for strings: ‘strength’ and ‘trend’.

Last cell in the matrix depicts edit distance i.


number of common characters in strings(b) Cosine Similarity: Cosine similarity calculates similarity by measuring the cosine of angle between two vectors.

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together.

The smaller the angle, higher the cosine similarity.

Cosine Similarity FormulaReferences:(1)Tutorial on Regular Expression: https://www.




html(2) Stop words in 16 languages: https://github.

com/mitmedialab/DataBasic/tree/master/nltk_data/corpora/stopwords(3) Stemming and Lemmatization: https://www.


com/community/tutorials/stemming-lemmatization-python(4) Stemming and Lemmatization: https://towardsdatascience.

com/stemming-lemmatization-what-ba782b7c0bd8(5) https://blog.


com/what-is-the-difference-between-stemming-and-lemmatization/(6) Collocations: https://medium.

com/@nicharuch/collocations-identifying-phrases-that-act-like-individual-words-in-nlp-f58a93a2f84a(7) POS Tagging: https://medium.

com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb(8) Alphabetical list of part-of-speech tags used in the Penn Treebank Project(9) Dependency parsing: https://shirishkadam.

com/2016/12/23/dependency-parsing-in-nlp/.. More details

Leave a Reply