An NLP View on Holiday Movies — Part I: Topic Modeling using Gensim and SKlearn

But still, Christmas is too omnipresent.Plan B: ClusteringAs a fallback plan, I wanted to try and do a simple K-Means clustering exercise on the TFIDF-vectorized documents.This is done simply enough:We can then use T-SNE to reduce the dimensionality of the TF-IDF vector down to two, and use the new clusters to color the dots.The results look good enough:(Disclosure note: I asked my wife to help name the clusters ????)Cluster 0 looks like mainly wedding moviesCluster 1 looks like love movies in generalCluster 2 movies seem like love movies, but with a fantasy twist (royals, witches) to itCluster 3 movies seem to be seasonal love movies (harvest, summer)Cluster 4 movies are cleary the Christmas movies: jackpot!Cluster 5 movies: not sure yet, my wife insists we watch some to get a better feel..Not falling for that one …Cluster 6 movies look like movies with a ‘food’ aspect to itSomething cool that struck me: the movie ‘Marry Me at Christmas’ is a hybrid between a Christmas movie and a wedding movie..K-means tagged it as a wedding movie, but as we can see, it could just as easily have been a Christmas movie:On a general note, TF-IDF really helps a lot towards obtaining a good clustering result.Alrighty, so this will definitely do..We will take all ‘Cluster 4’ movies for the next step into our goal to generate our own Christmas movie, in Part II of this blog post..The final dataset (in a pandas dataframe) can be found in the repo as well.See you there!. More details

Leave a Reply