Analyzing Text Classification Techniques on Youtube Data

Let’s pop a bottle of champagne to celebrate!”No, not yet.

Even though computers today are able to solve the issues of the world and play hyper-realistic video games, they are still machines who do not understand our language.

Thus, we cannot feed our text data as it is to our machine learning models, no matter how clean it is.

Thus we need to convert them into numerical based features such that the computer can construct a mathematical model as a solution.

This constitutes the data pre-processing stepCategory column after LabelEncodingSince the output variable(‘Category’) is also categorical in nature, we need to encode each class as a number.

This is called Label Encoding.

Finally, let’s pay attention to the main piece of information for each sample — the raw text data.

In order to extract data from the text as features and represent them in a numerical format, a very common approach is to vectorize them.

The Scikit-learn library contains the ‘TF-IDFVectorizer’ for this very purpose.

TF-IDF(Term Frequency-Inverse Document Frequency) calculates the frequency of each word inside and across multiple documents in order to identify the importance of each word.

Data Analysis and Feature ExplorationAs an additional step, I have decided to show the distribution of classes so check for an imbalanced number of samples.

Also, I wanted to check if the features extracted using TF-IDF vectorization made any sense, therefore I decided to find the most correlated unigrams and bigrams for each class using both the Titles and the Description features.

# USING TITLE FEATURES# 'art and music':Most correlated unigrams:——————————.

paint.

official.

music.

art.

theatreMost correlated bigrams:——————————.

capitol theatre.

musical theatre.

work theatre.

official music.

music video# 'food':Most correlated unigrams:——————————.

foods.

eat.

snack.

cook.

foodMost correlated bigrams:——————————.

healthy snack.

snack amp.

taste test.

kid try.

street food# 'history':Most correlated unigrams:——————————.

discoveries.

archaeological.

archaeology.

history.

anthropologyMost correlated bigrams:——————————.

history channel.

rap battle.

epic rap.

battle history.

archaeological discoveries# 'manufacturing':Most correlated unigrams:——————————.

business.

printer.

process.

print.

manufactureMost correlated bigrams:——————————.

manufacture plant.

lean manufacture.

additive manufacture.

manufacture business.

manufacture process# 'science and technology':Most correlated unigrams:——————————.

compute.

computers.

science.

computer.

technologyMost correlated bigrams:——————————.

science amp.

amp technology.

primitive technology.

computer science.

science technology# 'travel':Most correlated unigrams:——————————.

blogger.

vlog.

travellers.

blog.

travelMost correlated bigrams:——————————.

viewfinder travel.

travel blogger.

tip travel.

travel vlog.

travel blog# USING DESCRIPTION FEATURES# 'art and music':Most correlated unigrams:——————————.

official.

paint.

music.

art.

theatreMost correlated bigrams:——————————.

capitol theatre.

click listen.

production connexion.

official music.

music video# 'food':Most correlated unigrams:——————————.

foods.

eat.

snack.

cook.

foodMost correlated bigrams:——————————.

special offer.

hiho special.

come play.

sponsor series.

street food# 'history':Most correlated unigrams:——————————.

discoveries.

archaeological.

history.

archaeology.

anthropologyMost correlated bigrams:——————————.

episode epic.

epic rap.

battle history.

rap battle.

archaeological discoveries# 'manufacturing':Most correlated unigrams:——————————.

factory.

printer.

process.

print.

manufactureMost correlated bigrams:——————————.

process make.

lean manufacture.

additive manufacture.

manufacture business.

manufacture process# 'science and technology':Most correlated unigrams:——————————.

quantum.

computers.

science.

computer.

technologyMost correlated bigrams:——————————.

quantum computers.

primitive technology.

quantum compute.

computer science.

science technology# 'travel':Most correlated unigrams:——————————.

vlog.

travellers.

trip.

blog.

travelMost correlated bigrams:——————————.

tip travel.

start travel.

expedia viewfinder.

travel blogger.

travel blogModeling and TrainingThe four models we will be analyzing are:Naive Bayes ClassifierSupport Vector MachineAdaboost ClassifierLSTMThe dataset is split into Train and Test sets with a split ratio of 8:2.

Features for Title and Description are computed independently and then concatenated to construct a final feature matrix.

This is used to train the classifiers(except LSTM).

For using LSTM, the data pre-processing step is pretty different as discussed before.

Here is the process for that:Combine Title and Description for each sample into a single sentenceTokenize the combined sentence into padded sequences: Each sentence is converted into a list of tokens, each token is assigned a numerical id and then each sequence is made the same length by padding shorter sequences, and truncating longer sequences.

One-Hot Encoding the ‘Category’ variableThe learning curves for the LSTM are given below:LSTM Loss CurveLSTM Accuracy CurveAnalyzing PerformanceFollowing are the Precision-Recall Curves for all the different classifiers.

To get additional metrics, check out the complete code.

The ranking of each classifier as observed in our project is as follows:LSTM > SVM > Naive Bayes > AdaBoostLSTMs have shown stellar performance in multiples tasks in Natural Language Processing, including this one.

The presence of multiple ‘gates’ in LSTMs allows them to learn long term dependencies in sequences.

10 points to Deep Learning!SVMs are highly robust classifiers which try their best to find interactions between our extracted features, but the learned interactions are not at par with the LSTMs.

Naive Bayes Classifier, on the other hand, considers the features as independent, thus it performs a little worse than SVMs since it does not take into account any interactions between different features.

The AdaBoost classifier is quite sensitive to the choice of hyperparameters, and since I have used the default model, it does not have the most optimal parameters which might be the reason for the poor performanceI hope this has as been informative for you as much it has been for me.

The complete code can be found on my Github.

agrawal-rohit – OverviewJunior data scientist and software developer with 3+ years of experience in project work.

Highly skilled in machine…github.

comCiao.

. More details

Leave a Reply