Adventure Scrape: Text mining on Adventure Time transcripts

|| flips out, uses magic to bring out snow monstersThe code for concatenating all of these mini DataFrames into one large one is here.

Finally, we can explore some data!Data exploration, text mining styleIn order to analyze word-by-word, I first tokenized the data, following this guide.

Here “tokenizing” means separating the text data word by word, creating a DataFrame which has, for each transcript line, a column corresponding to each word of dialogue and/or action.

I tokenized dialogue and action separately.

Before I forget, here’s the exploratory analysis code.

With tokenized data we can easily count which words appear most frequently in a corpus.

I took a look at this both for a given character as a whole, as well as per episode.

The results probably won’t surprise you…Most prevalent words in Princess Bubblegum speech for several episodes.

That is, the most common words are articles (e.

g.

, “the”, “an”) and pronouns.

Since these words take up a lot of space in speech, they only provide meaning in the context of rarer, more distinct words.

How do we find which words those are, and how do we measure their impact?One method often used in text mining is computing the TF-IDF statistic.

This statistic allows us to weight the uniqueness of a word in a document relative to other documents in the same corpus, so as to determine which words define each document the most.

Following Nowacki’s blog post, I computed this statistic and others necessary for TF-IDF calculation.

Now let’s see what appears to define Princess Bubblegum’s speech according to TF-IDF!TF-IDF for some Princess Bubblegum speech.

These data look quite different.

The first noticeable feature is the two spikes at Episode 155 — since Princess Bubblegum only spoke two words in that episode, those words had disproportionate significance.

“gah” is weighted more strongly because “okay” is a more common word in other episodes in which she speaks.

We also see that the x-axes for these two plots feature very different words; TF-IDF showcases few if any filler words.

From the looks of these data, these episodes featured Lemonhope (Episodes 153 and 154) or James and the Jameses (Episode 158).

Checking with the Fandom Wiki, this seems to be correct!.So, for this dataset, TF-IDF probably is a good way to learn about key features for an episode, not necessarily a character’s emotions or contributions to it.

Since TF-IDF seems quite powerful, and some distinct words should reappear in multiple episodes, I wondered how certain words’ importance could change over time (i.

e.

, with increasing episode number), and if this could be a proxy for character development.

The code below shows how I constructed TF-IDF data as a function of time.

Note that I started from a MultiIndex DataFrame containing TF-IDF data, where I indexed by Word and Episode.

words = tf_idf.

index.

get_level_values(level=1).

unique().

tolist() ## get all unique words in character's speechget_more = tf_idf.

swaplevel(i=0, j=1, axis=0) ## remove hierarchical indexing, level by levelget_more.

unstack()get_more.

reset_index(level=1, inplace=True)get_more.

reset_index(level=0, inplace=True) ## word and Episode are both columns now, rather than indicesfor word in words: collection = get_more.

loc[get_more[‘dialogue word’] == word] if len(collection) > 1: collection.

plot(kind=’line’,y=’tf_idf’,x=’Episode’,title=’Word: ‘+word,marker=’.

’,xlim=(0,279)) plt.

show() plt.

close()Some notable examples for Princess Bubblegum are below.

I found it interesting that she only used the word “royal” right at the very beginning of the show, when she was quite the stuffy, controlling monarch.

The evolution of “know” ‘s importance was interesting, as she becomes more self-aware and less of a know-it-all, her use of the word waxes and wanes.

To contrast these, let’s look at a very common word, “the”, whose TF-IDF varies wildly, though the magnitudes always remain small:For one last look at Princess Bubblegum data, I had to see how her usage of “marceline” changed in importance over time.

We can see that “marceline” ‘s TF-IDF peaks around Episode 200.

I know that the Fandom Wiki has very incomplete data on the Stakes mini-series, which centers around Marceline’s back story, so I knew that wasn’t the cause for the spike.

Then I remembered that “Varmints” happened shortly before Stakes, and while this transcript was also incomplete, everything made more sense.

(For the uninitiated: this episode is the turning point in the relationship between Princess Bubblegum and Marceline.

)Given this context, I had to see how “marcy” data looked.

Indeed we see “marcy” becomes a relatively important term around “Varmints” and emerges as a term after ~100 episodes, indicating a change in their interaction.

With so few data points I recognize I’m really reading between the lines, but it is nice to see signatures of character growth in this simple statistic!From here the natural next step would be sentiment analysis, either per character or as a means of distinguishing characters in the show.

However, I only had ~1000 lines for Princess Bubblegum and maybe ~1200 for Ice King, two major characters in the show, and I considered this a risky endeavor.

However, there was significantly more data on Finn and Jake, with roughly 7000 and 5000 lines for each of them…InFINNerator: something entirely differentWhile I am quite new to machine learning in general, I have caught onto the fact that deep learning continues to be trending.

My perusal of machine learning news suggests that deep learning is increasingly applied on domains usually reserved to NLP.

Thus, instead of sentiment analysis, I dove right into deep learning.

I decided to create a speech generator for Finn… the InFINNerator.

(“InFINNerator” is a triple pun, so let that sink in.

)Behold.

I hadn’t labeled speech with any labels other than, well, “dialogue”; typically sentiment analysis looks for good vs.

bad, and it requires labeled data.

After all the initial cleaning for this dataset, I needed a break from dataset construction and alteration.

Maybe I will label dialogue by approximate sentiment one day, but at the moment the time cost of hand-labeling that data was too high.

The InFINNerator is a character-level recurrent neural network (RNN) written in PyTorch, run on Google Colab.

I followed this great PyTorch tutorial to understand the basics of RNNs in PyTorch.

I quickly found the simple architecture used to generate names did not have what it takes to generate speech for an enthusiastic human in a post-apocalyptic world.

( I won’t show the RNN code here, as it is very similar to a lot of other RNN sample code out there.

Please see the Colab notebook if interested! )Before diving into RNN alterations, I first should say that I had to revise my data structure further for this task.

I noticed some Unicode characters in lines of dialogue; I changed everything to ASCII.

Also, lines straight from the transcripts were of highly variable length, and this would create unnecessary complexity in the data.

I shuffled the lines, and then I joined all the lines together after separating them with an end-of-line character.

To create sequences, I chopped the corpus into 64-character new lines.

The first modification I made to the RNN was to train in mini-batches.

Initially I trained one one-hot char at a time, with each gradient step completed after one 64-char line had been processed.

With mini-batches, a gradient step finished after a 64-char line had been processed across a batch of lines (I used a batch size of 32 lines).

This led to minor improvements.

This single-layer vanilla RNN generated little more than gibberish with lots of exclamation points.

However, to be fair to this first attempt, Finn usually is very excited about everything, so it did get to the essence of his outward expression.

Here’s a taste of the output:00ohi ts therikn g homee, is!.int n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!So, I had to kick it up a few more notches.

I switched out the vanilla RNN for an LSTM network.

This LSTM had 1–2 layers and a linear decoder, and at the end I returned the output as a LogSoftmax distribution.

The LSTM was loosely inspired by this character RNN.

I tried several architectures:1 layer with hidden_layer_size=256, learning rate=4e-4,1e-32 layers with hidden_layer_size=256, learning rate=1e-4,4e-41 layer with hidden_layer_size=512, learning rate=1e-4,4e-4,1e-31 layer with hidden_layer_size=1024, learning rate=1e-4,4e-4Unfortunately, the performance of these beefed up LSTM’s weren’t much better than the mini-batched vanilla RNN…If I do wind up improving InFINNerator, I will update this post.For now, enjoy the sweet Finn screams.

You can hear this image, can’t you?In the meantime…I have a few ideas why this didn’t perform as well as I hoped:Google Colab occasionally mis-remembers the state of the code it’s running, and that can lead to inconsistencies in trainingThe dataset is small, and Adventure Time is a show with a lot of variation in its subject matter, so the data simply could be too volatile.

Honestly I expected to over-fit the data since I only had ~300,000 characters total…but it seems much more training time or a bigger network could help.

There are some secrets to PyTorch I have yet to discover.

If you have any feedback on InFINNerator or anything else here, please let me know.I have learned a lot on this project, and I have much yet to learn about machine learning, text mining, and deep learning.

Thanks for reading!.

. More details

Leave a Reply