Upgrade your Image Classifier with Balanced data

[6]PCA OutputFrom the PCA summary its clear that the first 25 principal components hold more than 80% of the data.PCA SummaryThe tSNE visualization after PCA is strikingly better, but most of the classes are mapped somewhat proximate and not categorized with clarity.t-SNE PC 1–25The same preprocessing steps and the orthogonal rotation from PCA are applied to the validation data and test dataStep 5: StratificationIn this step, we will deal with the imbalanced class problem in the train data.SMOTE is an oversampling approach in which the minority class is over-sampled by creating “synthetic” examples rather than by over-sampling with replacement [1]A combination of the synthetic oversampling of minority class (SMOTE)and random undersampling of the majority class, will give a nice balanced train data which can then be used to build a classifier that performs better..This is evident from the tSNE visualization of the new train data belowt-SNE Balanced DataStep 6: ClassificationWe build a classifier using the prepared train data..From the Confusion matrix of the model, we could see that the overall accuracy has improved and is nearly 60 %.Confusion matrix — Classifier trained using the processed dataStep 7: Hyper parameter tuningBy tweaking the complexity parameter (cp) we can reduce the relative error as shown in the plot belowVariation in Relative error with complexity parameterStep 8: Build the final modelThe last step is to build the final model with the tuned parameter (cp value with minimum Xval-Relative Error) and evaluate it using the test data.The confusion matrix stats say the classifier performance has improved dramatically and the overall accuracy is greater than 90 %..Other performance metrics such as sensitivity, specificity etc too is better.Confusion Matrix — Final modelOther Performance metrics comparison for various classesConclusionThe preprocessing, dimensionality reduction & balancing the dataset improved the performance of the decision tree classifier remarkably from 27% to 91%..Visualizing the high dimensional data using tSNE can give you a better idea of the distribution of classes, which can help significantly in preprocessing the data that can be used to build a better classifier.References[1] N..Chawla, K..Bowyer, L..Hall and W..Kegelmeyer, SMOTE: Synthetic Minority Over-sampling Technique (2002), Journal of Artificial Intelligence Research, 16, pp.321–357.[2] Guest Blog, How to handle Imbalanced Classification Problems in machine learning (2017), [Blog] Analytics Vidhya[3] L..van der Maaten and G.Hinton, Visualizing Data using t-SNE (2008), Journal of Machine Learning Research, 9, pp.2579–2605.[4] Z..ZHENG, A benchmark for Classifier Learning (1993), In Australian Joint Conference on Artificial Intelligence..World Scientific, pp.281–286.[5] M.G.. More details

Leave a Reply