My secret sauce to be in top 2% of a Kaggle competition

So, let’s get right into it!One of the most important aspects of building any supervised learning model on numeric data is to understand the features well. Looking at partial dependence plots of a model helps you understand how the model’s output changes with any feature.sourceBut, the problem with these plots is that they are created using a trained model..If we could create these plots from train data directly, it could help us understand the underlying data better..These plots help us understand what the feature is telling about customers and how it will affect the model..Its your local test set/validation set for which you know target.Comparison of feature trends in train and testFeatexp calculates two metrics to display on these plots which help with gauging noisiness:Feature below is not holding the same trend and hence, has a low trend correlation of 85%..It’s also important to not drop too many important features as it might lead to a drop in performance. Also, you can’t identify these noisy features using feature importance because they could be fairly important and still be very noisy!Using test data from a different time period works better because then you would be making sure if feature trend holds over time.get_trend_stats() function in featexp returns a dataframe with trend correlation and changes for each feature.Dataframe returned by get_trend_stats()Let’s actually try dropping features with low trend-correlation in our data and see how results improve.AUC for different feature selections using trend-correlationWe can see that higher the trend-correlation threshold to drop features, higher is the leaderboard (LB) AUC. Not dropping important features further improves LB AUC to 0.74..Whole code can be found in featexp_demo notebook.The insights that you get by looking at these plots help with creating better features..Based on XGBoost model’s feature importance, DAYS_BIRTH is actually more important than EXT_SOURCE_1.Looking at Featexp’s plots helps you in capturing bugs in complex feature engineering codes by doing these two things:Zero variation features show only a single binData leakage from target to features leads to overfitting..Leaky features have high feature importance..Based on what the feature is, this could be because of a bug or the feature is actually populated only for defaulters (in which case it should be dropped). Knowing what the problem is with leaky feature leads to quicker debugging.Understanding why a feature is leakySince featexp calculates trend correlation between two data sets, it can be easily used for model monitoring..Trend correlation can help you monitor if anything has changed in feature w.r.t.. More details

Leave a Reply