Having fun with NLP and Game of Thrones dialogues.

This article will try to answer those questions.

Before anything I would like to mention these two guys whose work inspired a lot of what you are about to see, so please check their posts and their code as well:Paras Chopra and Daniel E.


Disclaimer on data.

The dataset consists of 65 chapters of the series, not all seasons are complete, especially season 2, 3 and 4, all other seasons are complete.

You can learn more about the technical aspects of gathering the data here.

Do more dialogues mean more words?Barplot for dialoguesI plotted the top 20 characters with most dialogues and with a higher count of words and we can clearly see the Tyrion indeed never shuts up, with more than a thousand dialogues and an average of 16 words per dialogue is the most talkative character by far.

Barplot for wordsIndeed dialogues and words are related and the top 20 characters barely changed from one plot to another, one interesting change is that of the Lannisters, who speak more words per dialogue.

Let’s check that distribution of words per dialogue.

Distribution of words per dialogue of top 5 characters.

Jon Snow is a man of few words, his plot is almost flat, very similar to that of Daenerys, the Lannisters' plots are more robust, meaning more words every time they open their mouths.

Longest dialogue of the sample.

There is a big outlier in Jaime’s plot, so much bigger than the other 4, when I checked I realized that is the biggest of them all.

The longest uninterrupted dialogue of our sample data is that of Jaime Lannister telling Brianne how he killed the Mad King back in season 3 with more than 350 words.

What our characters actually saying, is there a pattern?I plotted the 60 most used words by the top 5 characters, I used NLTK to get rid of some very used words in the English language and some other words that will not bring any value to the plot like ‘King’, ‘ Lord’, ‘Sir’, etc.

Only the Lannisters have LOVE in their words and I think that really defines them as characters, their motivation is mostly passional.

While the Starks like Jon and Ned fight because of duty and to save the world or Dany who is fighting because is her destiny and is pretty clear by the words she uses, she is a ruthless ruler; the Lannisters fight because of the people they love.

I could also say that Game of Thrones is about a bunch of murderous people with extreme daddy issues.

This one was a shocker.

Who is the main character?Speaking the most does not always mean to be the most influential character, that rests on how important the people you speak to are, and how many times you are mention outside your own dialogues.

How related you are to the world around you, so here is the interaction network of our sample data.



org/chrismartinezb/e35f6c6b7a4def1dc56eea92d8897d40/ee9d335a443b042fc20c2f2eb0d55e9997d2f2b9Check the link of the graph for a better view and for a chance to play with the nodes.

Anyways this graph tells us nothing numerically about the importance of a character.

It does tell us how important Cersei, Tyrion, and Jon are, but any could be the main character, right?The degree of centrality.

If we measure the degree of centrality of each node, we will get a very clear answer on how important or central our characters really are to the series, so let’s have a look on the top 10 most important characters according to our sample data:Well, Jon is indeed the song of Ice and fire by far the most important character on the show according to our sample.

Generating text using the sample dialogue.

Before pre-trained models were available, you needed a huge corpus of text to do anything meaning.

Now, even a small dataset is enough to do interesting things.

Let me know in comments what project ideas come to your mind that could use a small text corpus along with a pre-trained model.

I took Paras Chopra challenge and generated some dialogues myself, and the results are quite funny.


The main character is Jon Snow, due to his connections with other important characters like Ned, Sansa, Tyrion, and Dany.

Tyrion is below Cersei in the character centrality plot, I think this is because of two main reasons: Season 2 and 3 have a lot of missing chapters in the dataset, and those seasons are where Tyrion shines the most, and second, because Cersei had a strong relationship with Ned and Robert (very central characters) but Tyrion never talked any of them.

More dialogue is somehow related to character influence but only until a certain degree.

Other characters must about you even when you are not talking to them.

Ned Stark even after his death in season 1 is one of the most important characters.

Tyrion is more Lannister than an ally to Daenerys, which might be a reason to think about him betraying her in the last season like a lot of people are theorizing.

He was clustered with the Lannisters and when checking the cloud plot he has a lot of similarities to them, more than to Daenerys or Jon.

Even with only about 60% of the data, we could draw pretty good conclusions.

Text generation with such a small dataset worked better than I expected and it was really funny to read, it can be improved a lot though.

Generating dialogues between characters will be an interesting challenge, is there an approach that you can think of?Who do you think is really the main character?.and why?What other conclusions could you come up with using this data?Check this repository with every step of the analysis here.


. More details

Leave a Reply