A Game of Words: Vectorization, Tagging, and Sentiment Analysis

With Bag of Words, you can perform a logistic regression or other classification algorithm to show what documents (rows) within the array are most similar.

This can be helpful when trying to see if two articles are related in topic.

Skip Thought Vectors and Word2Vec both cluster words based on meaning within a text, which is a method called word embedding.

This technique is important because it preserves relationships among words.

Especially when dealing with review text data (anything with a numerical rating accompanying the text review), these techniques can yield valuable insights about what the consumers are feeling and thinking.

Since A Game of Thrones does not have default values for classification, I have no way of validating a model, I am going to explain alternative methods of analyzing a text below.

POS taggingPart of Speech tagging (POS) is where a part of speech is assigned to each word in a list using context clues.

This is useful because the same word with a different part of speech can have two completely different meanings.

For example, if you have two sentences [‘A plane can fly’ and ‘There is a fly in the room’], it would be important to define ‘fly’ and ‘fly’ correctly in order to determine how the two sentences are related (aka not at all).

Tagging words by part of speech allows you to do chunking and chinking which is explained later.

An important note is that POS tagging should be done straight after tokenization and before any words are removed so that sentence structure is preserved and it is more obvious what part of speech the word belongs to.

One way to do this is by using nltk.

pos_tag():import nltkdocument = ' '.

join(got1[8:10])def preprocess(sent): sent = nltk.

word_tokenize(sent) sent = nltk.

pos_tag(sent) return sentsent = preprocess(document)print(document)print(sent)[‘“Dead is dead,” he said.

“We have no business with the dead.

” ‘, ‘“Are they dead?” Royce asked softly.

“What proof have we?” ‘][,…(‘“’, ‘“’), (‘We’, ‘PRP’), (‘have’, ‘VBP’), (‘no’, ‘DT’), (‘business’, ‘NN’), (‘with’, ‘IN’), (‘the’, ‘DT’), (‘dead’, ‘JJ’), (‘.

’, ‘.

’), (“‘’”, “‘’”),…]Here is a snippet of what was created above, and you can see that adjectives are represented as ‘JJ’, nouns as ‘NN’, and so on.

This information will be used when chunking later.

Named Entity RecognitionSometimes it is helpful to further define the parts of speech for special words, especially when trying to process articles about current events.

Beyond being nouns, ‘London’, ‘Paris’, ‘Moscow’, and ‘Sydney’ are all locations that have specific meaning attached to them.

The same goes for names of people, organizations, times, money, percents, and dates among other things.

This process is important in text analysis because it can be a way to go about understanding chunks of text.

Generally, to apply NER to a text, tokenization and POS tagging must have been performed previously.

The nltk package has two methods to do NER built in, both of which are explained well in this article.

Another useful way to perform NER and have the capability to visualize and sort the results is through the spaCy package.

A good walkthrough of this can be found here.

I explored the GOT text using this method, and had some interesting results:import spacyfrom collections import Counterimport en_core_web_smnlp = en_core_web_sm.

load()from pprint import pprintdoc = nlp(document3)pprint([(X.

text, X.

label_) for X in doc.

ents])[(‘George R.

R.

Martin ‘, ‘PERSON’),(‘Ser Waymar Royce’, ‘PERSON’),(‘fifty’, ‘CARDINAL’),(‘Will’, ‘PERSON’),(‘Royce’, ‘PERSON’),(‘Eight days’, ‘DATE’),(‘nine’, ‘CARDINAL’),(‘Waymar Royce’, ‘PERSON’),(‘Gared’, ‘PERSON’),(‘Gared’, ‘ORG’),(‘forty years’, ‘DATE’),…]In the code above, document3 is the full text of A Game of Thrones in a single string.

This package efficiently found and classified all types of entities.

It was a bit confused on some instances of Gared (at one point it classified him as PERSON, another as ORG, and another later on as WORK_OF_ART).

However, overall this gave more insight into the content of the text than just POS tagging did.

A count of how many matches per type of entity and the top entities found is below.

Unsurprisingly, there were a lot of names found in the text.

labels = [x.

label_ for x in doc.

ents]items = [x.

text for x in doc.

ents]print(Counter(labels))print(Counter(items).

most_common(5))Counter({‘CARDINAL’: 340, ‘DATE’: 169, ‘FAC’: 34, ‘GPE’: 195, ‘LAW’: 2, ‘LOC’: 24, ‘MONEY’: 1, ‘NORP’: 32, ‘ORDINAL’: 88, ‘ORG’: 386, ‘PERSON’: 2307, ‘PRODUCT’: 35, ‘QUANTITY’: 23, ‘TIME’: 86, ‘WORK_OF_ART’: 77})[(‘Jon’, 259), (‘Ned’, 247), (‘Arya’, 145), (‘Robert’, 132), (‘Catelyn’, 128)]Chunking and ChinkingChunking and chinking are two methods used to extract meaningful phrases from a text.

They combine POS tagging and Regex to produce text snippets that match the phrase structures requested.

One implementation of chunking is to find phrases that provide descriptions of different nouns, called noun phrase chunking.

The form of a noun phrase chunk is generally composed of a determinant/possessive, adjectives, a possible verb, and the noun.

If you find that your chunks have part that you do not want, or that you’d rather split the text on a specific POS, an easy way to achieve your goal is by chinking.

This defines a small chunk (called a chink) that should be removed or split on when chunking.

I am not going to explore chinking in this article, but a tutorial can be found here.

The easiest way to do specific types of chunking with NLTK is using the nltk.

RegexpParser(r‘<><><>’).

This allows you to specify your noun phrase formula, and is very easy to interpret.

Each <> references the part of speech of one word to match, and normal regex syntax applies within each <>.

This is very similar to the nltk.

Text().

findall(r’<><><>’) concept, but just with POS instead of actual words.

A couple of things to note when creating the Regex string to parse is that the part of speech abbreviations (NN=noun, JJ=adjective, PRP=preposition, etc.

) can vary between packages, and sometimes it is good to start more specific and then broaden your search.

If you’re super lost right now, a good intro to this concept can be found here.

Also, it may be a good idea to brush up on sentence structures and parts of speech before so that you can fully interpret what the chunking returns.

Here is an example of this applied to GOT:document2 = ' '.

join(got1[100:300])big_sent = preprocess(document2) # POS tagging wordspattern = 'NP: {<DT>?<JJ>*<NN.

?>+}'cp = nltk.

RegexpParser(pattern)cs = cp.

parse(big_sent)print(cs)(…, (NP Twilight/NNP) deepened/VBD .

/.

(NP The/DT cloudless/NN sky/NN) turned/VBD (NP a/DT deep/JJ purple/NN) ,/, (NP the/DT color/NN) of/IN (NP an/DT old/JJ bruise/NN) ,/,…)This is a very similar idea to NER, as you can group NN or NNP (nouns or proper nouns) together to find full names of objects.

Also the pattern to match can be any combination of parts of speech, which is useful when looking for certain kinds of phrases.

However, if the POS tagging is incorrect, you will not be able to find the types of phrases you are looking for.

I only looked for noun phrases here, but there are more types of chunks included in my github code.

Sentiment AnalysisSentiment Analysis is how a computer combines everything covered so far and comes up with a way to communicate the overall gist of a passage.

It compares the words in a sentence, paragraph, or another subset of text to a list of words in a dictionary and calculates a sentiment score based on how the individual words in a sentence are categorized.

This is mostly used in analyzing reviews, articles, or other opinion pieces, but I am going to apply this to GOT today.

I am mainly interested in seeing if the overall tone of the book is positive or negative, and it that tone varies between chapters.

There are two ways of doing sentiment analysis: you can train and test a model on previously categorized text and then use that to predict whether new text of the same type will be positive or negative, or you can simply use an existing lexicon built into the function that will analyze and report a positive or negative score.

Here is an example of the latter or some sentences from the first page of A Game of Thrones:from nltk.

sentiment.

vader import SentimentIntensityAnalyzernltk.

download('vader_lexicon')sid = SentimentIntensityAnalyzer()for sentence in sentences: print(sentence) ss = sid.

polarity_scores(sentence) for k in sorted(ss): print('{0}: {1}, '.

format(k, ss[k]), end='') print()…“Do the dead frighten you?” compound: -0.

7717, neg: 0.

691, neu: 0.

309, pos: 0.

0, Ser Waymar Royce asked with just the hint of a smile.

 compound: 0.

3612, neg: 0.

0, neu: 0.

783, pos: 0.

217, Gared did not rise to the bait.

 compound: 0.

0, neg: 0.

0, neu: 1.

0, pos: 0.

0,…Since this is analyzing text of a book and not text of reviews, a lot of the sentences are going to have a neutral compound score (0).

This is totally fine for my purposes however, because I am just looking for general trends in the language of the book over time.

But it is still nice to see that when dead is mentioned a negative score is applied.

TextBlob is another useful package that can perform sentiment analysis.

Once you turn your text into a TextBlob object (textblob.

textBlob()), it has functions to tokenize, lemmatize, tag plain text, and make a WordNet, which quantifies the similarity between words.

There are a lot of different text objects specific to this package that allow for really cool transformations, explained here.

There is even a correct() function that will attempt to correct spelling mistakes.

I am not going to go into most of these in this article, as I am trying to analyze a book which should generally have correct spelling and syntax, however many of these tools would be useful when dealing with particularly messy text data.

Here is TextBlob’s version of sentiment analysis on the first page of A Game of Thrones:from textblob import TextBlobdef detect_polarity(text): return TextBlob(text).

sentimentfor sentence in sentences: print(sentence) print(detect_polarity(sentence))“Do the dead frighten you?” Sentiment(polarity=-0.

2, subjectivity=0.

4) Ser Waymar Royce asked with just the hint of a smile.

 Sentiment(polarity=0.

3, subjectivity=0.

1) Gared did not rise to the bait.

 Sentiment(polarity=0.

0, subjectivity=0.

0)There is similarity between the sentiment scores of nltk and textblob, but the nltk version has more variability since it is a compound score.

The textblob sentiments alternatively have a subjectivity score, which is good for telling how accurately a sentence may be classified.

Below is a distribution of the sentiments by page per method.

Textblob overall gave higher sentiment ratings, whereas nltk had more variance with the score.

If you are trying to gather sentiment from social media text or emojis, the VADER Sentiment Analysis is a tool specifically curated for that task.

It has built in slang (lol, omg, nah, meh, etc.

) and can even understand emojis.

A good walkthrough of how to use it can be found here.

Also, if Python is not your go to language for text analysis, there are other methods in different languages/software to do sentiment analysis that are explained here.

Other NLP packagesI only explained functions the nltk, textblob, vaderSentiment, spacy, and sklearn packages in this article, but there are many advantages and disadvantages to them depending on the task you’re trying to accomplish.

Some others that may be better suited to your task are Polyglot and Genism.

Polyglot is known for having the ability to analyze a large number of languages (supports 16–196 depending on the task).

Genism is primarily used for unsupervised learning tasks on text, and will need any preprocessing to be done with a different package.

You can find a handy chart with all this information here.

ConclusionOne key thing that I’ve learned from writing this article is that there are always at least three ways to accomplish a single task, and determining the best option just depends on what kind of data you are using.

Sometimes you are going to prioritize computation time, and other times you will need a package that can do unsupervised learning well.

Text processing is a fascinating science, and I cannot wait to see where it leads us in the next few years.

In this article I covered vectorization and how that can determine similarity between text, tagging which allows meaning to be attached to words, and sentiment analysis which tells roughly how positive or negative a text is.

I have gleaned many insights from Game of Thrones, like there is a lot of death, sir is a common title spelled Ser, and there are not as many instances of dragons as I was led to believe.

However, I may be convinced to read the books now!.I hope you enjoyed the article!A copy of my code, which has further examples and explanation, can be found here on github!.Feel free to take and use the code as you please.

My first article ‘Text Preprocessing Is Coming’ can be found here!.. More details

Leave a Reply