The Data Science Behind Natural Language Processing

What is natural language processing and how does it work?Natural language processing (NLP) is a discipline in computer science and artificial intelligence.

NLP is the communication between people and machines to both interpret our meaning and to construct valid responses.

The field has been around since the 1950s and you may have heard of the “Turing Test” developed by Alan Turing.

The Turing Test measures how well a computer responds to human written questions.

If an independent person cannot tell the difference between a person and machine then that computing system is ranked intelligent.

We have come a long way since the 1950s and there have been many advances in the fields of data science and linguistics.

The remainder of this article will detail some of the basic capabilities of these algorithms in the field of natural language processing.

We will include some code examples using Python.

TokenizationTo get started in natural language processing we will start with some very simple text parsing.

Tokenization is the process of taking a stream of text like a sentence and breaking it down to its most basic words.

For instance take the following sentence: “The red fox jumps over the moon.

” Each word would represent a token of which there are seven.

To Tokenize a sentence using python:myText = ‘The red fox jumps over the moon.

’myLowerText = myText.

lower()myTextList = myLowerText.

split()print(myTextList)OUTPUT:[‘the’, ‘red’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘moon’]Parts of SpeechParts of speech is used to determine syntactic function.

In the English language the main parts of speech are: adjective, pronoun, noun, verb, adverb, preposition, conjunction, and interjection.

This is used to infer the intent of the word based on its use.

For example the word PERMIT can be a noun and a verb.

Verb use: “I permit you to go to the dance.

” Noun use: “Did you get the permit from the county.

”To execute parts of speech using Python: (use the NLTK library)You may have to install NLTK which is a Python library for natural language processing.

Instructions on NLTK: CLICK HEREimport nltkmyText = nltk.

word_tokenize(‘the red fox jumps over the moon.

’)print(‘Parts of Speech: ‘, nltk.

pos_tag(myText))OUTPUT:Parts of Speech: [(‘the’, ‘DT’), (‘red’, ‘JJ’), (‘fox’, ‘NN’), (‘jumps’, ‘NNS’), (‘over’, ‘IN’), (‘the’, ‘DT’), (‘moon’, ‘NN’), (‘.

’, ‘.

’)]So you can see how NLTK breaks sentences into tokens and interprets parts of speech, for instance (‘fox’, ‘NN’):NN noun, singular ‘fox’Stop Word RemovalMany sentences and paragraphs include words that have very little meaning or value.

These words include “a,” “and,” “an,” and “the.

” Stop word removal is a process of removing these words from a sentence or stream of words.

To perform stop word removal using Python and NLTK: (Again instructions on NLTK here)from nltk.

corpus import stopwordsfrom nltk.

tokenize import word_tokenizeexample_sent = “a red fox is an animal that is able to jump over the moon.

” stop_words = set(stopwords.

words(‘english’)) word_tokens = word_tokenize(example_sent) filtered_sentence = [w for w in word_tokens if not w in stop_words] filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.

append(w) print(filtered_sentence)OUTPUT:[‘red’, ‘fox’, ‘animal’, ‘able’, ‘jump’, ‘moon’, ‘.

’]StemmingStemming is the process of reducing noise in a word and is otherwise referred to as lexicon normalization.

It reduces inflection.

For example, the word “fishing” has a stem word “fish.

” Stemming is used to simplify a word down to its base meaning.

Another good example is the word “like” which is the stem of many words such as: “likes,” “liked,” and “likely.

” Search engines use stemming for this reason.

In many situations it could be useful for a search for one of these words to return documents that contain another word in the set.

To perform stemming using Python and the NLTK library:rom nltk.

stem import PorterStemmerfrom nltk.

tokenize import word_tokenizeps = PorterStemmer()words = [“likes”, “likely”, “likes”, “liking”]for w in words:print(w, ” : “, ps.

stem(w))OUTPUT:(‘likes’, ‘ : ‘, u’like’)(‘likely’, ‘ : ‘, u’like’)(‘likes’, ‘ : ‘, u’like’)(‘liking’, ‘ : ‘, u’like’)LemmatizationStemming and lemmatization are very similar in that they enable you to get to the root word.

This is called word normalization and both can generate the same output.

However, they work very differently.

Stemming attempts to chop words off where lemmatization provides you with the ability to see if the word is a noun, verb or other parts of speech.

Let’s take the world “saw.

” Stemming will bring back “saw” and lemmatization could bring back “see” or “saw.

” Lemmatization usually brings back a readable word where stemming may not.

See below for an example showing the difference.

Let’s take a look at a Python example which compares stemming to lemmatization:from nltk.

stem import PorterStemmer# from nltk.

tokenize import word_tokenizefrom nltk.

stem import WordNetLemmatizerlemmatizer = WordNetLemmatizer()ps = PorterStemmer()words = [“corpora”, “constructing”, “better”, “done”, “worst”, “pony”]for w in words:print(w, ” STEMMING : “, ps.

stem(w), ” LEMMATIZATION “, lemmatizer.

lemmatize(w, pos=‘v’))OUTPUT:corpora STEMMING : corpora LEMMATIZATION corporaconstructing STEMMING : construct LEMMATIZATION constructingbetter STEMMING : better LEMMATIZATION gooddone STEMMING : done LEMMATIZATION doneworst STEMMING : worst LEMMATIZATION badpony STEMMING : poni LEMMATIZATION ponyConclusionLinguistics is the study of language, morphology, syntax, phonetics, and semantics.

This field, including data science and computing, has blown up over the past 60 years.

We just explored some very simple text analytic capabilities in NLP.

Google, Bing, and other search engines leverage this technology to help you find information on the world wide web.

Think of how easy it is to have Alexa play your favorite song or how Siri helps you with directions.

It is all because of NLP.

Natural language in computing is not a gimmick or toy.

NLP is the future of seamless computing in our lives.

Arcadia Data just released a version 5.

0 which includes our natural language query capabilities we call Search Based BI.

It uses some of the data science and text analytics described above.

Check out this video on our Search Based BI tool to learn more: SEARCH-BASED BI.

ORIGINALLY POSTED: https://www.

arcadiadata.

com/blog/the-data-science-behind-natural-language-processing/.. More details

Leave a Reply