Text Processing Is Coming

Text Processing Is ComingHow to use Regular Expression (Regex) and the Natural Language Toolkit (NLTK) on Game of Thrones Book 1Madeline McCombeBlockedUnblockFollowFollowingJun 11Photo by Bharat Patil on UnsplashIf you’re like me, and you’ve seen more memes about Game of Thrones (GOT) than actual episodes of the show, you might be wondering why everyone is so obsessed with it.

Since I don’t have time to watch the show or read the books, I’m going to use basic text processing to get a general understanding of what I’m missing.

In this article I’ll be using the regular expression and natural language toolkit packages in Python to explore, clean, tokenize, and visualize the text.

There is a part two where different methods of analyzing the words are explored available here.

The text from all 5 books can be found on Kaggle here.

I will be using the text of the first book (A Game of Thrones, 1996), which has 571 pages containing 20,168 lines of text.

I will be explaining these concepts in order to clean the text:Regex syntaxRegex functionsTokenizationStemming/LemmatizationCombining NLTK and RegexVisualizing Word FrequenciesWhat is Regex?Regular expression is a language of different symbols and syntax that can be used to search for a piece of string within a larger string.

It can be used in almost any coding language, and is very useful when trying to search for general string patterns.

Most often, it is used in web scraping, input validation, and simple parsing.

In Python the Regex package can be imported using import re.

This gives you access to many different functions and string sequences that allow you to search for anything you want to.

A regex string refers to the string of letters, symbols, or numbers that tells Regex what to look for.

For example, if you want to find all instances of ‘Daenerys’, the regex string would look like r‘Daenerys’.

However, if you want to find all words that start with ‘D’, the regex string would look like r‘D[a-z]+’.

The D at the beginning of the string must be matched, the square brackets represent a pick one, and the + says that you must pick 1 or more times from the brackets to complete the word.

I am going to go through some basic forms of regex strings that will be useful when searching through text.

r‘A*’ matches A 0 or more times (‘’, ‘A’, ‘AA’, ‘AAA’, ‘AAAA’, etc.

)r‘A+’ matches A 1 or more times (‘A’, ‘AA’, ‘AAA’, ‘AAAA’, etc.

)r‘[a-z]*’ matches any lowercase letter 0 or more times (‘’, ‘ajrk’, ‘bor’, ‘q’, etc.

)r‘[0–9]+’ matches any number 1 or more times (‘1’, ‘254’, ‘1029345’, etc.

)r‘ing$’ matches words that end with -ing (‘running’, ‘climbing’, ‘ing’, etc.

)r‘^st’ matches words that start with st- (‘string’, ‘stir’, ‘strive’, etc.

)r‘[^a]’ will match any string without an a (‘surprise’, ‘d4nfo.

’, interesting!’, etc.

)r‘.

{4}’ matches any string of 4 characters without a newline (‘9?rf’, ‘(hi)’, etc.

)There are several special sequences in regex that consist of a backslash followed by a letter.

A backslash () is an escape character that can negate the traditional meaning of whatever follows it.

A ‘w’ would normally match ‘w’, but r‘w+’ matches one or more alphanumeric characters.

If the letter is lowercase, then it matches everything that the special sequence defines, but if the letter is uppercase then the regex string matches everything except what it defines.

A list of the special sequences letters can be found here.

There are many more techniques than those I briefly covered, and you can go here, here, here, or here to see further information!Searching Text using RegexThe re package has several built in functions that can be used to apply a regex string to a body of text and find matches among other things.

I will go over a few, and explain the differences between them.

All of these functions take at least two arguments: the pattern to match (regex string) and the text to search.

The following three functions return a match object, which consists of the index of the string matched (start and stop), as well as what string the function matched.

They are also limited to finding one match per query.

re.

search: finds the first instance matching the pattern, returns a match object or Nonere.

match: finds an instance of the pattern only at the beginning of the string, returns a match object or Nonere.

fullmatch: finds whether the whole string matches the pattern given, returns a match object or Noneprint(re.

search(r'q[a-zA-Z]+', 'There is a queen in the castle.

'))# <_sre.

SRE_Match object; span=(11, 16), match=’queen’>print(re.

match(r'[a-zA-Z]+', 'There is a queen in the castle.

'))# <_sre.

SRE_Match object; span=(0, 5), match=’There’>print(re.

fullmatch(r'.

*', 'There is a queen in the castle.

'))# <_sre.

SRE_Match object; span=(0, 31), match=’There is a queen in the castle.

’>The next two functions find all matches within a string of the pattern.

Here, re.

findall returns a list of all the matches, whereas re.

finditer allows you to pull out specific information about each match using a loop.

re.

findall: finds all non-overlapping matches to the pattern, returns a list of all matchesre.

finditer: finds all non-overlapping matches to the pattern, returns an iterator object that can tell you the start/stop/contents of the matchprint(re.

findall(r'w+ w+', 'There is a queen in the castle.

'))# [‘There is’, ‘a queen’, ‘in the’]print(re.

finditer(r'dragon', got))# <callable_iterator object at 0x7f062e51a400>for m in re.

finditer(r'w+ w+', 'There is a queen in the castle.

'): print('Start:', m.

start(), 'End:', m.

end(), 'Text:', m.

group())# Start: 0 End: 8 Text: There is # Start: 9 End: 16 Text: a queen # Start: 17 End: 23 Text: in theThe following two functions are ways to split or modify a string after searching for a pattern.

They both return the modified string.

re.

sub(pattern, replacement, string): replaces the pattern with a replacement string, returns modified stringre.

split: splits a string based on a pattern, return list of stringsprint(re.

sub(r'w+', 'word', 'There is a queen in the castle.

'))# word word word word word word word.

print(re.

split(r'[^a-zA-Z']+',"This is the queen's castle.

So exciting!"))# ['This', 'is', 'the', "queen's", 'castle', 'So', 'exciting', '']Now that we’ve covered the basics of Regex, let’s move on to preprocessing the GOT text.

TokenizationIn order to analyze a text, its words must be pulled out and analyzed.

One way to do this is to split each text by spaces so that individual words are returned.

However, this doesn’t take into account punctuation or other symbols that might want to be removed.

This process of breaking sentences, paragraphs, or chapters into individual words is called tokenization, and is an essential step before any type of text analysis is performed.

Luckily, there is a package in Python called the Natural Language Toolkit that has a ton of useful functions to manipulate text.

It can be imported using import nltk.

This package includes a word tokenizer and a sentence tokenizer, which breaks the text down into words and sentences respectively.

The word tokenizer breaks text into words, punctuation, and any miscellaneous characters.

This means that punctuation detaches itself from the word and becomes its own element in the list.

The sentence tokenizer breaks text by traditional sentence punctuation (.

, ?, !, etc.

), and keeps the punctuation attached to the sentence.

Here is an example of each:from nltk.

tokenize import word_tokenize, sent_tokenizeprint(word_tokenize("This is the queen's castle.

Yay!"))# ['This', 'is', 'the', 'queen', "'s", 'castle', '.

', 'Yay', '!']print(sent_tokenize(got)[1:3])# ['"The wildlings are?.dead.

"', '"Do the dead frighten you?"']So now that you have a list of all the words and punctuation in the text, what next!.The list of tokens can be run through a loop and everything that is in a list of stopwords can be removed.

Stopwords are words that occur too frequently or have very little meaning, and should be removed.

This can be thought of as a type of dimension reduction, as you are taking away words that will not allow you to glean information about the text.

It can also be useful to remove words that occur too infrequently.

from nltk.

corpus import stopwordsstop_words=stopwords.

words("english")print(random.

sample(stop_words, 8))print('There are', len(stop_words), 'English stopwords.

')# [‘now’, ‘about’, ‘to’, ‘too’, ‘himself’, ‘were’, ‘some’, “you’ll”] # There are 179 English stopwords.

import stringpunct = list(string.

punctuation)print(punct[0:13])print('There are', len(punct), 'punctuation marks.

')# [‘!’, ‘“‘, ‘#’, ‘$’, ‘%’, ‘&’, “‘“, ‘(‘, ‘)’, ‘*’, ‘+’, ‘,’, ‘-’] # There are 32 punctuation marks.

stops = stop_words + punct + ["''", 'r.

', '“', "'s", "n't"]filtered_words=[]for w in got_words: if w.

lower() not in stops: filtered_words.

append(w.

lower())print(filtered_words[0:8])# [‘game’, ‘thrones’, ‘book’, ‘one’, ‘song’, ‘ice’, ‘fire’, ‘george’]With this text, each Page number of the book is specified as ‘Page X’.

In my current list of cleaned words each instance of this is shown as [‘page’, ‘#’], and I will deal with that in my next article when I do further text analysis.

Stemming and LemmatizationIf either of those words sound like a weird form of gardening, I totally get it.

However, these are actually two techniques used to combine all variants of a word into its parent form.

For example, if a text has ‘running’, ‘runs’, and ‘run’ , those are all forms of the parent word ‘run’, and should be transformed and counted as the same word since they have the same meaning.

Going through the text line by line and trying to figure out if each word should be transformed to its base form is computationally intensive and a waste of time.

Luckily the nltk package introduced in the previous section has functions that can do this for you!Stemming removes the end of a word (-ing, -ed, -s, or another common ending) in the hopes that it will find the ‘base’ form of a word.

This method works well for words like ‘running’-’run’, ‘climbing’-’climb’, and ‘pouring’=’pour’, but doesn’t work for other words, such as ‘leaves’-’leav’.

Here is a simple example of this in action:from nltk.

stem.

porter import PorterStemmerps = PorterStemmer()stemmed_words=[]for w in filtered_words: stemmed_words.

append(ps.

stem(w))print('Original:', filtered_words[7], filtered_words[13], filtered_words[15], filtered_words[26])# Original: george urged began askedprint('Stemmed:', stemmed_words[7], stemmed_words[13], stemmed_words[15], stemmed_words[26])# Stemmed: georg urg began askThis method doesn’t quite get all the words correctly transformed, and george is changed to georg.

To fix this, lemmatization can be used instead of stemming, which achieves the same effect but uses a dictionary of lemma (the base form of a word) to figure out if truncating the end of a word makes sense.

It also takes into account the type of word (noun, verb, adjective) to better guess at the parent.

This method allows ‘leaves’ to transform to ‘leaf’, rather than ‘leav’.

from nltk.

stem.

wordnet import WordNetLemmatizerlem = WordNetLemmatizer()lemm_words=[]for w in filtered_words: lemm_words.

append(lem.

lemmatize(w, 'v'))print('Original:', filtered_words[7], filtered_words[13], filtered_words[15], filtered_words[26])# Original: george urged began askedprint('Lemmatized:', lemm_words[7], lemm_words[13], lemm_words[15],lemm_words[26])# Lemmatized: george urge begin askNotice how lemmatization correctly transforms urged, began, and asked to urge, begin, and ask because it treats all of the tokens as a verb and searches for the base form.

It also ignores all words that do not need to be transformed.

A useful guide on how to do stemming and lemmatization in Python can be found here.

Combining NLTK and RegexNow that you’ve learned a little about Regex and a little about what NLTK has to offer, I am going to explain the intersection of the two.

When tokenizing a text, it is possible to split on something other than the default in the nltk.

This is done by using nltk.

tokenize.

RegexpTokenizer(pattern).

tokenize(text), and you can specify what Regex string to split on.

This is similar to re.

split(pattern, text), but the pattern specified in the NLTK function is the pattern of the token you would like it to return instead of what will be removed and split on.

There are also a bunch of other tokenizers built into NLTK that you can peruse here.

Here are some examples of the nltk.

tokenize.

RegexpTokenizer():from nltk.

tokenize import RegexpTokenizer print(RegexpTokenizer(r'w+').

tokenize("This is the queen's castle.

So exciting!"))# ['This', 'is', 'the', 'queen', 's', 'castle', 'So', 'exciting']There is also an easy way to implement some of the functions covered previously, especially re.

search.

This can be used to double check that the stemming/lemmatization did what was expected.

In this case we can see that verbs ending in -ing were removed from the list of words during lemmatization.

words_ending_with_ing = [w for w in got_words if re.

search("ing$", w)]print('Tokens:', words_ending_with_ing[3:8])# Tokens: ['falling', 'being', 'something', 'rushing', 'Something']words_ending_with_ing2 = [w for w in lemm_words if re.

search("ing$", w)]print('Lemmatized:', words_ending_with_ing2[3:7])# Lemmatized: ['something', 'something', 'wildling', 'something']Another useful way to find instances of phrases within a list of tokens is available by first turning the list into a text object using nltk.

Text(list), and then subsetting that object using text.

findall(r‘<>’) Where each <> holds the regex string to match one token in a sequence.

I will go through some examples below, and there is a helpful reference here for further exploration.

got_text = nltk.

Text(lemm_words)print(got_text)# <Text: game throne book one song ice fire george.

>print(got_text.

findall(r'<.

*><daenerys><.

*>'))# hide daenerys brother; usurper daenerys quicken; day daenerys # want; hot daenerys flinch; archer daenerys say; princess daenerys # magister; hand daenerys find; help daenerys collar; sister # daenerys stormborn;.

drogo daenerys targaryenThis is a pretty cool way to figure out what mini phrases your text might have hidden, and is a good place to start to analyzing a text.

in particular, following Daenerys through the whole book gives a very quick summary of her character arc.

Next, I am going to introduce different ways to visualize the frequencies of words within a list of tokens.

Visualizing word frequenciesA handy way to get a grasp of the text before actually analyzing it is to look at what words occur most frequently.

This is also a good idea to do to make sure that you have removed all of the stop words that are necessary.

A basic way to visualize words and their relative frequency is a wordcloud, and a great walkthrough of this can be found here.

This method only shows the relative frequencies of each word as compared to the others, and can be difficult to interpret given the abstract nature.

To remedy this, I am going to explore different methods of visualization included in the nltk package, including lexical dispersion plots, frequency distribution plot, and n-gram frequency distribution plots.

A Frequency Distribution plot plots the words according to frequency.

Plot twist, I know.

This is useful because it shows how much of the text is made up of different themes or ideas.

The more common a word it, the more central it is to the theme of a text.

freqdist = nltk.

FreqDist(got_words)plt.

figure(figsize=(16,5))plt.

title('GOT Base Tokens Frequency Distribution')freqdist.

plot(50)Original Token Frequency DistributionLemmatized Frequency distributionI included the original tokens in one distribution and the filtered, lemmatized words in another distribution to highlight the importance of removing stopwords and other non-essential symbols.

The first distribution tells me nothing about the book, whereas the second gives me several words and character names that captures the essence and writing style of George R.

R.

Martin.

Based on the findings of the lemmatized distribution, I chose 10 words to graph in a lexical dispersion plot.

A Lexical dispersion plot shows the distribution of a word respective to when it shows up in the text (the offset).

For example, if the word of interest was ‘fox’ and the sentence was, ‘The quick brown fox jumps over the lazy dog’, the offset of the word would be four, since it is the fourth word in the sentence.

This technique of plotting word and its offset shows themes over time.

It is also a very customizable plot, since you can choose the words to visualize.

from yellowbrick.

text import DispersionPlottopics ['dragon', 'horse', 'death', 'father', 'king', 'brother', 'eye', 'hand', 'lord', 'ser']plt.

figure(figsize=(16,5))visualizer = DispersionPlot(topics)visualizer.

fit([lemm_words])visualizer.

poof()Lexical Dispersion PlotAs you can see, there are different themes in the book at different times.

Further analysis of different keywords would uncover different meaning, but this a good start to see what concepts are important in the book.

ConclusionIn this article we introduced Regex and different methods of searching through text, then went through basic tools in the NLTK package and saw how Regex could be combines with NLTK.

At the end we drew broad summaries from the book, and prepared the text for further analysis.

I hope you enjoyed this journey, and I hope you will explore part two (should be up in a day or so)!A copy of my code, which has further examples and explanation, can be found here on github.Feel free to take and use the code as you please.

Further resourceshttps://www.

datacamp.

com/community/tutorials/text-analytics-beginners-nltk (great source for general text processing)https://www.

datacamp.

com/courses/natural-language-processing-fundamentals-in-python (a course in NLP).

. More details

Leave a Reply