XGBoost is not black magic

If the percentage of missing values in a sample increases, the performances of the built-in strategy could worsen a lot.Ok, the default direction is the best possible choice given that it reached the current position, but there’s no guarantee that the current position is the best situation possible considering all the features of the current sample.Overcoming this limitation means dealing with a sample considering all its features at the same time and tackling directly the possible simultaneous presence of more than one missing value in the same realisation.Imputing missing values and improving performancesIn order to beat the XGBoost built-in strategy we have to consider at the same time all the features of a sample and somehow deal with the possible presence of more than a missing value in it..A good example of such an approach is a K-Nearest Neighbours (KNN) with an ad-hoc distance metric to properly deal with missing values..Generally speaking, KNN is a well-known algorithm that retrieves the K (eg. 3, 10, 50, …) closest samples to the sample considered..It might be used both to classify an unseen input or to impute missing values, in both the cases assigning to the target value the mean or median value considering the K nearest neighbours..This kind of method requires a distance metric (or, correspondingly, a similarity measure) to actually rank all the samples in the training set and to retrieve the K most similar.To outperform the XGBoost built-in default strategy we need two things:a distance metric that takes into account missing values (thanks to this post by AirBnb for the inspiration)A Numpy implementation of the employed distance metricto normalise the dataset to have meaningful distances, obtained summing up differences among features with different domains (this is not strictly required by XGBoost but it’s needed for KNN imputation!).The missing value of a feature is imputed using the median value of said feature of the K closest samples, and in the very specific case of not to find at least one non-missing value in the K retrieved neighbours, the median of the whole column is used.Experimental resultsI’ve run some tests using three well-know datasets available for free in scikit-learn (two classifications and one regression)..Performances have been measured via k-fold cross-validation comparing three different imputation strategies:the default one built-in in the XGBoost algorithma simple column-wise median imputationa KNN as described in the previous paragraphFor the KNN case I’ve plotted the best performances obtained for the considered missing values percentage with respect both to k (number of neighbours to consider) and λ (constant to be added to the distance when a feature is missing for at least one of the two samples ).Figure 1Imputing missing values with a sparsity aware KNN outperformed consistently the other two methods..The extent of the difference is of course dataset dependent..As a first naive conclusion: the less the quality of the dataset the more the influence of a better imputation strategy..As Figure 2 shows, the built-in strategy, in the end, has performances close to a trivial column-wise median imputation.Figure 2It’s quite interesting to see how k and λ influence the final results and how the introduction of a penalisation factor makes sense not just on paper..A distance metric that not only discards missing values but also adds weight for each one of them is crucial for the performances obtained with this method, even if its value is not directly correlated with the increasing percentage of missing values.Figure 3Tests have shown that, as a rule of thumb, the higher the quantity of missing values, the higher the number of neighbours to consider for a better imputation..Once again, a very intuitive conclusion.. More details

Leave a Reply