Step-by-Step Deep Learning Tutorial to Build your own Video Classification Model

Let’s read it as well: View the code on Gist.

This is how the first five rows look like.

We have the corresponding class or tag for each frame.

Now, using this .

csv file, we will read the frames that we extracted earlier and then store those frames as a NumPy array: View the code on Gist.

Output: (73844, 224, 224, 3) We have 73,844 images each of size (224, 224, 3).

Next, we will create the validation set.

  Creating a validation set To create the validation set, we need to make sure that the distribution of each class is similar in both training and validation sets.

We can use the stratify parameter to do that: View the code on Gist.

Here, stratify = y (which is the class or tags of each frame) keeps the similar distribution of classes in both the training as well as the validation set.

Remember – there are 101 categories in which a video can be classified.

So, we will have to create 101 different columns in the target, one for each category.

We will use the get_dummies() function for that: View the code on Gist.

Next step – define the architecture of our video classification model.

  Defining the architecture of the video classification model Since we do not have a very large dataset, creating a model from scratch might not work well.

So, we will use a pre-trained model and take its learnings to solve our problem.

For this particular dataset, we will be using the VGG-16 pre-trained model.

Let’s create a base model of the pre-trained model: View the code on Gist.

This model was trained on a dataset that has 1,000 classes.

We will fine tune this model as per our requirement.

include_top = False will remove the last layer of this model so that we can tune it as per our need.

Now, we will extract features from this pre-trained model for our training and validation images: View the code on Gist.

Output: (59075, 7, 7, 512) We have 59,075 images in the training set and the shape has been changed to (7, 7, 512) since we have passed these images through the VGG16 architecture.

Similarly, we will extract features for validation frames: View the code on Gist.

Output: (14769, 7, 7, 512) There are 14,769 images in the validation set and the shape of these images has also changed to (7, 7, 512).

We will use a fully connected network now to fine-tune the model.

This fully connected network takes input in single dimension.

So, we will reshape the images into a single dimension: View the code on Gist.

It is always advisable to normalize the pixel values, i.


, keep the pixel values between 0 and 1.

This helps the model to converge faster.

View the code on Gist.

Next, we will create the architecture of the model.

We have to define the input shape for that.

So, let’s check the shape of our images: View the code on Gist.

Output: (59075, 25088) The input shape will be 25,088.

Let’s now create the architecture: View the code on Gist.

We have multiple fully connected dense layers.

I have added dropout layers as well so that the model will not overfit.

The number of neurons in the final layer is equal to the number of classes that we have and hence the number of neurons here is 101.

  Training the video classification model We will now train our model using the training frames and validate the model using validation frames.

We will save the weights of the model so that we will not have to retrain the model again and again.

So, let’s define a function to save the weights of the model: View the code on Gist.

We will decide the optimum model based on the validation loss.

Note that the weights will be saved as weights.


You can rename the file if you wish.

Before training the model, we have to compile it: View the code on Gist.

We are using the categorical_crossentropy as the loss function and the optimizer is Adam.

Let’s train the model: View the code on Gist.

I have trained the model for 200 epochs.

To download the weights which I got after training the model, you can use this link.

We now have the weights which we will use to make predictions for the new videos.

So, in the next section, we will see how well this model will perform on the task of video classification!.  Evaluating our Video Classification Model Let’s open a new Jupyter notebook to evaluate the model.

The evaluation part can also be split into multiple steps to understand the process more clearly: Define the model architecture and load the weights Create the test data Make predictions for the test videos Finally, evaluate the model   Defining model architecture and loading weights You’ll be familiar with the first step – importing the required libraries: View the code on Gist.

Next, we will define the model architecture which will be similar to what we had while training the model: View the code on Gist.

This is the pre-trained model and we will fine-tune it next: View the code on Gist.

Now, as we have defined the architecture, we will now load the trained weights which we stored as weights.

hdf5: View the code on Gist.

Compile the model as well: View the code on Gist.

Make sure that the loss function, optimizer, and the metrics are the same as we used while training the model.

  Creating the test data You should have downloaded the train/test split files as per the official documentation of the UCF101 dataset.

If not, download it from here.

In the downloaded folder, there is a file named “testlist01.

txt” which contains the list of test videos.

We will make use of that to create the test data: View the code on Gist.

We now have the list of all the videos stored in a dataframe.

To map the predicted categories with the actual categories, we will use the train_new.

csv file: View the code on Gist.

Now, we will make predictions for the videos in the test set.

  Generating predictions for test videos Let me summarize what we will be doing in this step before looking at the code.

The below steps will help you understand the prediction part: First, we will create two empty lists – one to store the predictions and the other to store the actual tags Then, we will take each video from the test set, extract frames for this video and store it in a folder (create a folder named temp in the current directory to store the frames).

We will remove all the other files from this folder at each iteration Next, we will read all the frames from the temp folder, extract features for these frames using the pre-trained model, predict tags, and then take the mode to assign a tag for that particular video and append it in the list We will append actual tags for each video in the second list Let’s code these steps and generate predictions: View the code on Gist.

This step will take some time as there are around 3,800 videos in the test set.

Once we have the predictions, we will calculate the performance of the model.

  Evaluating the model Time to evaluate our model and see what all the fuss was about.

We have the actual tags as well as the tags predicted by our model.

We will make use of these to get the accuracy score.

On the official documentation page of UCF101, the current accuracy is 43.


Can our model beat that?.Let’s check!.View the code on Gist.

Output: 44.

80570975416337 Great!.Our model’s accuracy of 44.

8% is comparable to what the official documentation states (43.


You might be wondering why we are satisfied with a below 50% accuracy.

Well, the reason behind this low accuracy is majorly due to lack of data.

We only have around 13,000 videos and even those are of a very short duration.

  End Notes In this article, we covered one of the most interesting applications of computer vision – video classification.

We first understood how to deal with videos, then we extracted frames, trained a video classification model, and finally got a comparable accuracy of 44.

8% on the test videos.

We can now try different approaches and aim to improve the performance of the model.

Some approaches which I can think of are to use 3D Convolutions which can directly deal with videos.

Since videos are a sequence of frames, we can solve it as a sequence problem as well.

So, there can be multiple more solutions to this and I suggest you explore them.

Feel free to share your findings with the community.

As always, if you have any suggestions or doubts related to this article, post them in the comments section below and I will be happy to answer them.

And as I mentioned earlier, do check out the computer vision course if you’re new to this field.

You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.. More details

Leave a Reply