Automate Stacking In Python

It is time now to define a few first level models for our stacked generalization.

This step definitely deserves its own article but for purposes of simplicity, we are going to use three models: A KNN-Classifier, a Random Forest Classifier and an XGBoost Classifier.

models = [ KNeighborsClassifier(n_neighbors=5, n_jobs=-1), RandomForestClassifier(random_state=0, n_jobs=-1, n_estimators=100, max_depth=3), XGBClassifier(random_state=0, n_jobs=-1, learning_rate=0.

1, n_estimators=100, max_depth=3)]These parameters were not tuned prior to setting them as the purpose of this article is testing the package.

If you were to optimize performance, you should not just copy and paste these.

Taking the next part of code from the documentation, we are essentially performing the GIF’s first part using first level models to make predictions:S_train, S_test = stacking(models, X_train, y_train, X_test, regression=False, mode='oof_pred_bag', needs_proba=False, save_dir=None, metric=accuracy_score, n_folds=4, stratified=True, shuffle=True, random_state=0, verbose=2)The stacking function takes several inputs:models: the first level models we defined earlierX_train, y_train, X_test: our dataregression: Boolean indicating whether we want to use the function for regression.

In our case set to False since this is a classificationmode: using the earlier describe out-of-fold during cross-validationneeds_proba: Boolean indicating whether you need the probabilities of class labelssave_dir: save the result to directory Booleanmetric: what evaluation metric to use (we imported the accuracy_score in the beginning)n_folds: how many folds to use for cross-validationstratified: whether to use stratified cross-validationshuffle: whether to shuffle the datarandom_state: setting a random state for reproducibilityverbose: 2 here refers to printing all infoDoing so, we get the following output:task: [classification]n_classes: [3]metric: [accuracy_score]mode: [oof_pred_bag]n_models: [4]model 0: [KNeighborsClassifier] fold 0: [0.

72972973] fold 1: [0.

61111111] fold 2: [0.

62857143] fold 3: [0.

76470588] —- MEAN: [0.

68352954] + [0.

06517070] FULL: [0.

68309859]model 1: [ExtraTreesClassifier] fold 0: [0.

97297297] fold 1: [1.

00000000] fold 2: [0.

94285714] fold 3: [1.

00000000] —- MEAN: [0.

97895753] + [0.

02358296] FULL: [0.

97887324]model 2: [RandomForestClassifier] fold 0: [1.

00000000] fold 1: [1.

00000000] fold 2: [0.

94285714] fold 3: [1.

00000000] —- MEAN: [0.

98571429] + [0.

02474358] FULL: [0.

98591549]model 3: [XGBClassifier] fold 0: [1.

00000000] fold 1: [0.

97222222] fold 2: [0.

91428571] fold 3: [0.

97058824] —- MEAN: [0.

96427404] + [0.

03113768] FULL: [0.

96478873]Again, referring to the GIF, all that’s left to do now is fit the second level model(s) of our choice on our predictions to make our final predictions.

In our case, we are going to use an XGBoost Classifier.

This step is not significantly different from a regular fit-and-predict in sklearn except for the fact that instead of using X_train to train our model, we are using our predictions S_train.

model = XGBClassifier(random_state=0, n_jobs=-1, learning_rate=0.

1, n_estimators=100, max_depth=3) model = model.

fit(S_train, y_train)y_pred = model.

predict(S_test)print('Final prediction score: [%.

8f]' % accuracy_score(y_test, y_pred))Output: Final prediction score: [0.

97222222]ConclusionUsing vecstacks’ stacking automation, we’ve managed to predict the correct wine cultivar with an accuracy of approximately 97.

2%!.As you can see, the API does not collide with the sklearn API and could, therefore, provide a helpful tool when trying to speed up your stacking workflow.

As always, if you have any feedback or found mistakes, please don’t hesitate to reach out to me.

References:[1] David H.

Wolpert, Stacked Generalization (1992), Neural Networks[2] Igor Ivanov, Vecstack (2016), GitHub[3] M.

Forina et al, Wine Data Set (1991), UCI Machine Learning Repository.

. More details

Leave a Reply