Boosting: Is It Always The Best Option?

Boosting: Is It Always The Best Option?Michael Grogan (MGCodesandStats)BlockedUnblockFollowFollowingFeb 28Gradient boosting has become quite a popular technique in the area of machine learning.

Given its reputation for achieving potentially higher accuracy than other models, it has become particularly popular as a “go-to” model for Kaggle competitions.

However, use of gradient boosting raises two questions:Does this technique really outperform others consistently irrespective of the data being examined?Even if this is the case, are gradient boosting techniques always a wise choice?To answer these questions, I decided to compare the use of gradient boosting techniques to that of logistic regression by attempting to classify diabetes based on outcome.

The dataset is available at the UCI Machine Learning Repository.

Essentially, the dataset provides us with several features that are used to predict the outcome variable (diabetes = 1, no diabetes = 0).

Firstly, feature extraction with extratrees was performed to identify the most important features in predicting the outcome variable.

Then, the following models were run:Logistic RegressionGradient Boosting ClassifierLightGBM ClassifierXGBoost ClassifierAdaBoost ClassifierFeature ExtractionFeature Extraction is being used to determine the most important features that influence the outcome variable, i.

e.

which features have the strongest correlation with diabetes incidence.

>>> from sklearn.

ensemble import ExtraTreesClassifier>>> model = ExtraTreesClassifier()>>> model.

fit(x, y)ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.

0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.

0, n_estimators=10, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)>>> print(model.

feature_importances_)[0.

10696279 0.

25816011 0.

09378777 0.

09258844 0.

06920807 0.

11396286 0.

12328806 0.

1420419 ]From the feature extraction, features 0 (pregnancies), 1 (glucose), 5 (BMI), 6 (DiabetesPedigreeFunction), and 7 (Age) showed the highest scores in terms of feature importance, and these are the ones that are included in the models to predict the outcome variable.

Therefore, these variables were defined as xnew under a numpy column stack, and the data was partitioned into training and validation data with train_test_split.

x0=x[:,0]x1=x[:,1]x5=x[:,5]x6=x[:,6]x7=x[:,7]xnew=np.

column_stack((x0,x1,x5,x6,x7))xnewfrom sklearn.

model_selection import train_test_splitx_train,x_val,y_train,y_val=train_test_split(xnew,y,random_state=0)Logistic Regression vs.

Boosting ClassifiersHaving selected the relevant features and partitioning the data, a logistic regression was run in conjunction with several boosting classifiers.

# Logistic Regressionlogreg=LogisticRegression().

fit(x_train,y_train)logreg# GradientBoostingClassifierfrom sklearn.

ensemble import GradientBoostingClassifiergbrt = GradientBoostingClassifier(random_state=0)gbrt.

fit(x_train, y_train)# LightGBM Classifierimport lightgbm as lgblgb_model = lgb.

LGBMClassifier(learning_rate = 0.

001, num_leaves = 65, n_estimators = 100) lgb_model.

fit(x_train, y_train) # XGBoostimport xgboost as xgbxgb_model = xgb.

XGBClassifier(learning_rate=0.

001, max_depth = 1, n_estimators = 100)xgb_model.

fit(x_train, y_train)# AdaBoostfrom sklearn.

ensemble import AdaBoostClassifierfrom sklearn.

tree import DecisionTreeClassifierada_clf = AdaBoostClassifier( DecisionTreeClassifier(max_depth=1), n_estimators=100, algorithm="SAMME.

R", learning_rate=0.

001)ada_clf.

fit(x_train, y_train)As can be observed, the number of n_estimators was set to 100 while the learning rate was set to 0.

001.

Machine Learning Mastery offers more detail as to how to implement gradient boosting techniques, but in this case the learning rate (or shrinkage parameter) is set to below 0.

1 for better generalization error, while the number of n_estimators (or number of trees) is set to 100 in accordance with the recommended range of 100 to 500 as outlined in the “Greedy Function Approximation: A Gradient Boosting Machine” paper.

When these models were run, the following training and validation set scores were obtained:>>> print("Accuracy on training set: {:.

3f}".

format(logreg.

score(x_train,y_train)))Accuracy on training set: 0.

766>>> print("Accuracy on validation set: {:.

3f}".

format(logreg.

score(x_val,y_val)))Accuracy on validation set: 0.

797>>> print("Accuracy on training set: {:.

3f}".

format(gbrt.

score(x_train, y_train)))Accuracy on training set: 0.

896>>> print("Accuracy on validation set: {:.

3f}".

format(gbrt.

score(x_val, y_val)))Accuracy on validation set: 0.

792>>> print("Accuracy on training set: {:.

3f}".

format(lgb_model.

score(x_train, y_train)))Accuracy on training set: 0.

642>>> print("Accuracy on validation set: {:.

3f}".

format(lgb_model.

score(x_val, y_val)))Accuracy on validation set: 0.

677>>> print("Accuracy on training set: {:.

3f}".

format(xgb_model.

score(x_train, y_train)))Accuracy on training set: 0.

748>>> print("Accuracy on validation set: {:.

3f}".

format(xgb_model.

score(x_val, y_val)))Accuracy on validation set: 0.

750>>> print("Accuracy on training set: {:.

3f}".

format(ada_clf.

score(x_train, y_train)))Accuracy on training set: 0.

748>>> print("Accuracy on validation set: {:.

3f}".

format(ada_clf.

score(x_val, y_val)))Accuracy on validation set: 0.

750From looking at the above results, two things are evident:Only the GradientBoostingClassifier yields a similar validation accuracy to the logistic regression — all other boosting models show a slightly less validation accuracy.

Moreover, the accuracy of the logistic regression on the training set is slightly lower than that of the validation set, implying that overfitting is less of an issue on the logistic regression than on the gradient boosting models.

ConclusionBoosting models have become a “black box” model of sorts, and are increasingly being relied upon for increased accuracy.

However, these models don’t necessarily give the best accuracy in all cases (as we have seen here), and the issue of overfitting also must be considered.

It was observed that the training accuracy was significantly higher than the validation accuracy in many cases, and this indicates overfitting.

Boosting works on the premise of combining several weak models (e.

g.

many decision trees) in order to increase accuracy — hence why these are often referred to as ensemble models.

While boosting can be advantageous depending on the data one is working with, they do come with an overfitting risk and should not simply be relied upon by default without considering the data in question and whether other models could prove to be more suitable.

.

. More details

Leave a Reply