Breaking the curse of small datasets in Machine Learning: Part 1

From the above figures, we can note that k-NN is highly influenced by the available data and more data may help in making the model much more consistent and accurate.(c) Decision Trees: Similar to linear regression and k-NN, decision tree performance is also impacted by the amount of data.Fig 8: Difference in tree splitting due to the size of the dataDecision tree is also a non-parametric model and tries to best fit the underlying distribution of the data. The splitting is performed on features values with the aim of creating disparate classes at the child level. Since the model tries to best fit the available training data, the quantity of data directly determines the split levels and final classes. From the above figure, we can clearly observe that the split points and final class predictions get greatly influenced by the size of the dataset. More data helps in finding optimum split points and avoid overfitting.How to address the problem of less data?Fig 9: Basic implications of fewer data and possible approaches and techniques to solve itAbove figure tries to capture the core issues faced while dealing with small data sets and possible approaches and techniques to address them. In this part we will focus on only the techniques used in traditional machine learning and the rest will be discussed in part 2 of the blog.a) Change the loss function: For classification problems, we often use cross-entropy loss and rarely use mean absolute error or mean squared error to train and optimize our model. In the case of unbalanced data, the model becomes more biased towards the majority class as it has a larger influence on the final loss value and our model becomes less useful. In such scenarios, we can add weights to the losses corresponding to different classes to even out this data bias. For example, if we have two classes with data in the ration 4:1, we can apply weights in the ratio 1:4 to the loss function calculation to make the data balanced. This technique helps us easily mitigate the issue of unbalanced data and improves model generalization across different classes. We can easily find libraries in both R and Python which can help in assigning weights to classes during loss calculation and optimization. Scikit-learn has a convenient utility function to calculate the weights based on class frequencies:We can avoid the above calculation by using class_weight=`balanced` which does the same calculations to find class_weights. We can also feed explicit class weights as per our requirements. For more details refer to Scikit-learn’s documentationb) Anomaly/Change detection: In cases of highly imbalanced data sets like Fraud or machine failure, it is worth pondering if such examples can be considered as Anomaly or not..If the given problem meets the criterion of Anomaly, we can use models such as OneClassSVM, Clustering methods or Gaussian Anomaly detection methods..These techniques require a shift in thinking where we consider the minor class as the outliers class which might help us find new ways to separate and classify..Change detection is similar to anomaly detection except we look for a change or difference instead of an anomaly..These might be changes in the behavior of a user as observed by usage patterns or bank transactions..Please refer to the following documentation to learn how to implement anomaly detection with Scikit-Learn.Fig 10: Over and Undersampling depiction (Source)c) Up-sample or Down-sample: Since unbalanced data inherently penalizes majority class at different weight compared to a minority class, one solution to the problem is to make the data balanced. This can be done either by increasing the frequency of minority class or by reducing the frequency of majority class through random or clustered sampling techniques. The choice of Over-sampling vs under-sampling and random vs clustered is determined by business context and data size. Generally upsampling is preferred when the overall data size is small while downsampling is useful when we have a large amount of data. Similarly, random vs clustered sampling is determined by how well the data is distributed. For detailed understanding please refer to the following blog. Resampling can be easily done with the help of imblearn package as shown below:d) Generate Synthetic Data: Although upsampling or downsampling helps in making the data balanced, duplicate data increases the chances of overfitting..Another approach to address this issue is to generate synthetic data with the help of minority class data..Synthetic Minority Over-sampling Technique (SMOTE) and Modified- SMOTE are two such techniques which generate synthetic data..Simply put, SMOTE takes the minority class data points and creates new data points which lie between any two nearest data points joined by a straight line..In order to do this, the algorithm calculates the distance between two data points in the feature space, multiplies the distance by a random number between 0 and 1 and places the new data point at this new distance from one of the data points used for distance calculation..Note the number of nearest neighbors considered for data generation is also a hyperparameter and can be changed based on requirement.Fig 11: SMOTE in action with k=3 (Source)M-SMOTE is a modified version of SMOTE which takes the underlying distribution of the minority class also into consideration. This algorithm classifies the samples of minority classes into 3 distinct groups — Security/Safe samples, Border samples, and latent noise samples. This is done by calculating the distances among samples of the minority class and samples of the training data. Unlike SMOTE, the algorithm randomly selects a data point from the k nearest neighbors for the security sample, select the nearest neighbor from the border samples and does nothing for latent noise. For detailed understanding refer to the blog.e) Ensembling Techniques: The idea of aggregating multiple weak learners/different models have shown great results while dealing with imbalanced data sets..Both Bagging and Boosting techniques have shown great results across a variety of problems and should be explored along with methods discussed above to get better results.. More details

Leave a Reply