Predicting Movie Genres using NLP – An Awesome Introduction to Multi-Label Classification

Far more interesting and meaningful words have now emerged, such as “police”, “family”, “money”, “city”, etc.

  Converting Text to Features I mentioned earlier that we will treat this multi-label classification problem as a Binary Relevance problem.

Hence, we will now one hot encode the target variable, i.

e.

, genre_new by using sklearn’s MultiLabelBinarizer( ).

Since there are 363 unique genre tags, there are going to be 363 new target variables.

View the code on Gist.

Now, it’s time to turn our focus to extracting features from the cleaned version of the movie plots data.

For this article, I will be using TF-IDF features.

Feel free to use any other feature extraction method you are comfortable with, such as Bag-of-Words, word2vec, GloVe, or ELMo.

I recommend checking out the below articles to learn more about the different ways of creating features from text: An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec A Step-by-Step NLP Guide to Learn ELMo for Extracting Features from Text tfidf_vectorizer = TfidfVectorizer(max_df=0.

8, max_features=10000) I have used the 10,000 most frequent words in the data as my features.

You can try any other number as well for the max_features parameter.

Now, before creating TF-IDF features, we will split our data into train and validation sets for training and evaluating our model’s performance.

I’m going with a 80-20 split – 80% of the data samples in the train set and the rest in the validation set: View the code on Gist.

Now we can create features for the train and the validation set: # create TF-IDF features xtrain_tfidf = tfidf_vectorizer.

fit_transform(xtrain) xval_tfidf = tfidf_vectorizer.

transform(xval)   Build Your Movie Genre Prediction Model We are all set for the model building part!.This is what we’ve been waiting for.

Remember, we will have to build a model for every one-hot encoded target variable.

Since we have 363 target variables, we will have to fit 363 different models with the same set of predictors (TF-IDF features).

As you can imagine, training 363 models can take a considerable amount of time on a modest system.

Hence, I will build a Logistic Regression model as it is quick to train on limited computational power: from sklearn.

linear_model import LogisticRegression # Binary Relevance from sklearn.

multiclass import OneVsRestClassifier # Performance metric from sklearn.

metrics import f1_score We will use sk-learn’s OneVsRestClassifier class to solve this problem as a Binary Relevance or one-vs-all problem: lr = LogisticRegression() clf = OneVsRestClassifier(lr) Finally, fit the model on the train set: # fit model on train data clf.

fit(xtrain_tfidf, ytrain) Predict movie genres on the validation set: # make predictions for validation set y_pred = clf.

predict(xval_tfidf) Let’s check out a sample from these predictions: y_pred[3] It is a binary one-dimensional array of length 363.

Basically, it is the one-hot encoded form of the unique genre tags.

We will have to find a way to convert it into movie genre tags.

Luckily, sk-learn comes to our rescue once again.

We will use the inverse_transform( ) function along with the MultiLabelBinarizer( ) object to convert the predicted arrays into movie genre tags: multilabel_binarizer.

inverse_transform(y_pred)[3] Output: (Action, Drama) Wow!.That was smooth.

However, to evaluate our model’s overall performance, we need to take into consideration all the predictions and the entire target variable of the validation set: # evaluate performance f1_score(yval, y_pred, average=”micro”) Output: 0.

31539641943734015 We get a decent F1 score of 0.

315.

These predictions were made based on a threshold value of 0.

5, which means that the probabilities greater than or equal to 0.

5 were converted to 1’s and the rest to 0’s.

Let’s try to change this threshold value and see if that improves our model’s score: # predict probabilities y_pred_prob = clf.

predict_proba(xval_tfidf) Now set a threshold value: t = 0.

3 # threshold value y_pred_new = (y_pred_prob >= t).

astype(int) I have tried 0.

3 as the threshold value.

You should try other values as well.

Let’s check the F1 score again on these new predictions.

# evaluate performance f1_score(yval, y_pred_new, average=”micro”) Output: 0.

4378456703198025 That is quite a big boost in our model’s performance.

A better approach to find the right threshold value would be to use a k-fold cross validation setup and try different values.

  Create Inference Function Wait – we are not done with the problem yet.

We also have to take care of the new data or new movie plots that will come in the future, right?.Our movie genre prediction system should be able to take a movie plot in raw form as input and generate its genre tag(s).

To achieve this, let’s build an inference function.

It will take a movie plot text and follow the below steps: Clean the text Remove stopwords from the cleaned text Extract features from the text Make predictions Return the predicted movie genre tags View the code on Gist.

Let’s test this inference function on a few samples from our validation set: View the code on Gist.

Yay!.We’ve built a very serviceable model.

The model is not yet able to predict rare genre tags but that’s a challenge for another time (or you could take it up and let us know the approach you followed).

  Where to go from here?.If you are looking for similar challenges, you’ll find the below links useful.

I have solved a Stackoverflow Questions Tag Prediction problem using both machine learning and deep learning models in our course on Natural Language Processing.

The links to the course are below for your reference: Certified Course: Natural Language Processing (NLP) using Python Certified Program: NLP for Beginners The Ultimate AI & ML BlackBelt Program   End Notes I would love to see different approaches and techniques from our community to achieve better results.

Try to use different feature extraction methods, build different models, fine-tune those models, etc.

There are so many things that you can try.

Don’t stop yourself here – go on and experiment!.Feel free to discuss and comment in the comment section below.

The full code is available here.

You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.. More details

Leave a Reply