Multi-Class Classification in Text using R

Whereas, in this problem we have to deal with the classification of a data point into one of the 13 classes and hence, this is a multi-class classification problem.for (i in (1:length(ted_ratings))) { ted_ratings_df <- ted_ratings[[i]] highest_rating_count <- ted_ratings_df[which(ted_ratings_df$count == max(ted_ratings_df$count)), ] ted_talks$highest_rating[i] <- highest_rating_count$name}ted_talks$highest_rating = as.factor(ted_talks$highest_rating)With the above step, our dataset preparation is now complete.Data ModelingWe will now split our dataset into training and test..I have divided my dataset in 60:40 ratio.trainObs <- sample(nrow(ted_talks), .6 * nrow(ted_talks), replace = FALSE)testObs <- sample(nrow(ted_talks), .4 * nrow(ted_talks), replace = FALSE)train_dat <- ted_talks[trainObs,]test_dat <- ted_talks[testObs,]I will now apply all the pre-processing steps to my training and test data (separately)..Somehow, I was in dual mindset: whether to split the DTM into train and test or split the dataset and then prepare their DTM individually..Somehow, I chose the latter option..You can try with the former option and let me know if it works out fine for you.I also took care of sparsity, something which I discussed in good detail in my blog..I also renamed my target variable as “y” instead of highest_rating for better intuitiveness.train_corpus <- VCorpus(VectorSource(train_dat$transcript))##Removing Punctuationtrain_corpus <- tm_map(train_corpus, content_transformer(removePunctuation))##Removing numberstrain_corpus <- tm_map(train_corpus, removeNumbers)##Converting to lower casetrain_corpus <- tm_map(train_corpus, content_transformer(tolower))##Removing stop wordstrain_corpus <- tm_map(train_corpus, content_transformer(removeWords), stopwords(“english”))##Stemmingtrain_corpus <- tm_map(train_corpus, stemDocument)##Whitespacetrain_corpus <- tm_map(train_corpus, stripWhitespace)# Create Document Term Matrixdtm_train <- DocumentTermMatrix(train_corpus)train_corpus <- removeSparseTerms(dtm_train, 0.4)dtm_train_matrix <- as.matrix(train_corpus)dtm_train_matrix <- cbind(dtm_train_matrix, train_dat$highest_rating)colnames(dtm_train_matrix)[ncol(dtm_train_matrix)] <- “y”training_set_ted_talk <-$y <- as.factor(training_set_ted_talk$y)Now that we have our training dataset ready, we can train our model..I am using caret package and svmLinear3 method in caret..svmLinear3 provides L2 regularization in SVM with Linear Kernel..Agreed, that’s a lot of technical jargon which I am purposely not explaining here because that’s for another blog altogether..Meanwhile, I am going to leave some links for you to understand L2 regularization, and SVM with Linear Kernel.library(caret)review_ted_model <- train(y ~., data = training_set_ted_talk, method = ‘svmLinear3’)Preparing our test data..It’s the same repetitive procedure.test_corpus <- VCorpus(VectorSource(test_dat$transcript))##Removing Punctuationtest_corpus <- tm_map(test_corpus, content_transformer(removePunctuation))##Removing numberstest_corpus <- tm_map(test_corpus, removeNumbers)##Converting to lower casetest_corpus <- tm_map(test_corpus, content_transformer(tolower))##Removing stop wordstest_corpus <- tm_map(test_corpus, content_transformer(removeWords), stopwords(“english”))##Stemmingtest_corpus <- tm_map(test_corpus, stemDocument)##Whitespacetest_corpus <- tm_map(test_corpus, stripWhitespace)# Create Document Term Matrixdtm_test <- DocumentTermMatrix(test_corpus)test_corpus <- removeSparseTerms(dtm_test, 0.4)dtm_test_matrix <- as.matrix(test_corpus)Model Accuracy and other metricsI will now check the accuracy/performance of our model on test data.#Build the prediction model_ted_talk_result <- predict(review_ted_model, newdata = dtm_test_matrix)check_accuracy <- = model_ted_talk_result, rating = test_dat$highest_rating))library(dplyr)check_accuracy <- check_accuracy %>% mutate(prediction = as.integer(prediction) — 1)check_accuracy$accuracy <- if_else(check_accuracy$prediction == check_accuracy$rating, 1, 0)round(prop.table(table(check_accuracy$accuracy)), 3)library(performanceEstimation)classificationMetrics(as.integer(test_dat$highest_rating), model_ted_talk_result)most_common_misclassified_ratings = check_accuracy %>% filter(check_accuracy$accuracy == 0) %>% group_by(rating) %>% summarise(Count = n()) %>% arrange(desc(Count)) %>% head(3)##Most commong missclassified ratinglevels(train_dat$highest_rating)[most_common_misclassified_ratings$rating]The model metrics are:Model metricsThe top 3 most commonly misclassified ratings are: “Inspiring”, “Informative”, “Fascinating”..You can read more about micro and macro F1 scores from here and here.Final RemarksIn this article, we have discussed multi-class classification of text.. More details

Leave a Reply