Imbalanced Class Sizes and Classification Models: A Cautionary Tale

', 'Classes in rebalanced test set with ADASYN:',dict(zip(yvals_ads, counts_ads)))y_pred_smt = fit_logistic_regression_classifier(X_smoted, y_smoted)plot_confusion_matrix(ytest, y_pred_smt)y_pred_ads = fit_logistic_regression_classifier(X_adasyn, y_adasyn)plot_confusion_matrix(ytest, y_pred_ads)3.

Gridsearch on balanced classesSince the baseline model with ADASYN oversampling performed best in terms of recall, I performed a gridsearch on this test set to find the parameters that would furhter optimize model performance.

from sklearn.

model_selection import GridSearchCVgrid = {"C":np.

logspace(-3,3,7), "penalty":["l1","l2"]}# l1 lasso l2 ridgelogreg = LogisticRegression(random_state=88)logreg_cv = GridSearchCV(logreg,grid,cv=5,scoring='recall')logreg_cv.

fit(X_adasyn, y_adasyn)print("tuned hpyerparameters :(best parameters) ", logreg_cv.

best_params_)The logistic regression model with a C parameter of 0.

001 and a L2 regularization penalty had an improved recall score of 0.

65.

This means that the model was able to effectively catch 65 percent of new-users who would book Airbnbs internationally.

y_pred_cv = logreg_cv.

predict(X_test_scaled)print('accuracy = ',logreg_cv.

score(X_test_scaled, ytest).

round(2), 'precision = ',precision_score(ytest, y_pred_cv).

round(2), 'recall = ',recall_score(ytest, y_pred_cv).

round(2), 'f1_score = ',f1_score(ytest, y_pred_cv).

round(2) )plot_confusion_matrix(ytest, y_pred_cv)While balanced classes and hyperparameter tuning yielded significant improvements to the model’s recall score, model precision remained quite low, at 0.

3.

This means that only 30% of users classified as international travellers are actually booking Airbnbs internationally.

In a business setting, a model like this might be used to inform targeted ads for vacation homes based on predicted booking destination.

This means that 70 percent of users receiving suggestions for, say, homes overlooking the Eiffel Tower will in fact be looking to travel domestically.

Such mis-targeting would not only prove irrelevant to this group, but failure to disseminate relevant ads to the U.

S.

A.

/Canada group could mean missed revenue over time.

Now that I’ve accounted for overestimation of model performance by oversampling the minority class, next steps might include additional feature engineering to tease out more signal and fitting alternative classification algorithms (such as K-Nearest Neighbors or Random Forest Classifier).

ConclusionIn this example, model accuracy declined significantly once I rebalanced the target class sizes.

Even after hypterparameter tuning using gridsearch cross-validation, the logistic regression model was 10 percentage points less accurate than the baseline model with imbalanced classes.

This example demonstrates the importance of taking class imbalance into account so as to avoid over-estimating the accuracy of classification models.

I have also outlined, with working code, three techniques for re-balancing classes through over-sampling (Random Over Sampling, SMOTE, and ADASYN).

Further information on each technique can be found on the Imbalanced-Learn documentation.

.. More details

Leave a Reply