Shortcoming of Under-sampling Algorithms: CCMUT and E-CCMUT

Shortcoming of Under-sampling Algorithms: CCMUT and E-CCMUTNavoneel ChakrabartyBlockedUnblockFollowFollowingDec 7In one of my my previous article, “Under-sampling : A Performance Booster on Imbalanced Data”:Under-sampling : A Performance Booster on Imbalanced DataIn certain cases, the dataset which is to be used to develop a Machine Learning/Deep Learning Model, is often…towardsdatascience.comI have applied Cluster Centroid based Majority Under-sampling Technique (CCMUT) on Adult Census Data and proved the Model Performance Improvement w.r.t State-of-the-Art Model, “A Statistical Approach to Adult Census Income Level Prediction”[1]..The Validation Set of the model created by under-sampling had 3,151 instances of mixed labels (0 and 1)..On that Validation Set, the model correctly classified 2,861 out of 3,151 instances (90.78% accuracy) while the state-of-the-art when tested on these 3,151 instances gave 2,589 correct predictions out of 3,151 instances (82.16% accuracy)..And when tested on such instances (difficult instances), the model gets completely confused and predicts wrong labels majorly.Ways to remove the drawback:Under-sampling should be done by very small percentage (1–10%)..But this may fail if this cannot fetch any performance improvement in its own Validation Set.Under-sampling followed by Random Selection: Here, after under-sampling by less than 50%, among the under-sampled instances, random selection of the data-points can be done..So Dataset Creators/Compilers (UCI/Kaggle) can use CCMUT/E-CCMUT for under-sampling such instances, making it favourable for Machine Learning Model Development.This is my last article on “Under-sampling” and in my following articles I will be coming up with Implementation of Machine Learning Algorithms in Python from SCRATCH.MOTIVATION FOR THIS ARTICLEI would like to thank Victor Deplasse for being a motivation for this article of mine and for giving a thorough read on my article, “Under-sampling : A Performance Booster on Imbalanced Data” and questioning the consistency of Under-sampling Algorithms, CCMUT and E-CCMUT as I was in conviction that they are perfect Performance Boosters.. More details

Leave a Reply