LDA on the Texts of Harry Potter

Feel free to contact me with any questions!In this post, I’ll describe topic modeling with Latent Dirichlet Allocation and compare different algorithms for it, through the lens of Harry Potter..Let’s say I’ve got these three (rather non-sensical) documents:document_0 = "Harry Harry wand Harry magic wand"document_1 = "Hermione robe Hermione robe magic Hermione"document_2 = "Malfoy spell Malfoy magic spell Malfoy"document_3 = "Harry Harry Hermione Hermione Malfoy Malfoy"Here’s the term-frequency matrix for these documents:Just from glancing at this, it seems pretty obvious that document 0 is mostly about Harry, a little bit about magic, and partly about wand..In this case, we’ll plot the coherence score against the number of topics:You’ll generally want to pick the lowest number of topics where the coherence score begins to level off..The difference between Mallet and Gensim’s standard LDA is that Gensim uses a Variational Bayes sampling method which is faster but less precise that Mallet’s Gibbs Sampling..Fortunately for those who prefer to code in Python, Gensim has a wrapper for Mallet: Latent Dirichlet Allocation via Mallet..Once everything is set up, implementing the model is pretty much the same as Gensim’s standard model..Using Mallet, the coherence score for the 20-topic model increased to 0.375 (remember, Gensim’s standard model output 0.319)..It’s a modest increase, but usually persists with a variety of data sources so although Mallet is slightly slower, I prefer it for its increase in return.Finally, I built a Mallet model on the 192 chapters of all 7 books in the Harry Potter series..Here are the top 10 keywords the model output for each latent topic.. More details

Leave a Reply