Gentle Introduction of XGBoost Library

And when not to use XGBoost?Two common terms used in ML is Bagging & BoostingBagging: It is an approach where you take random samples of data, build learning algorithms and take simple means to find bagging probabilities.

Boosting: Boosting is similar, however, the selection of sample is made more intelligently.

We subsequently give more and more weight to hard to classify observations.

Hottest library in supervised machine learning.

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.

It implements machine learning algorithms under the Gradient Boosting framework.

XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

The same code runs on the major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

What makes the XGBoost so popular?Speed and Performancecore algorithm is parallelizableIt has both linear model solver and tree learning algorithms.

So, what makes it fast is its capacity to do parallel computation on a single machinestate of the art performance in many ML tasksFEATURES — XGBOOSTSpeed: it can automatically do parallel computation on Windows and Linux, with OpenMP.

It is generally over 10 times faster than the classical gbm.

Input Type: it takes several types of input data:Dense Matrix: R’s dense matrix, i.

e.

matrix ;Sparse Matrix: R’s sparse matrix, i.

e.

Matrix::dgCMatrix ;Data File: local data files ;xgb.

DMatrix: its own class (recommended).

Sparsity: it accepts sparse input for both tree booster and linear booster, and is optimized for sparse input ;Customization: it supports customized objective functions and evaluation functions.

optimized distributed gradient boosting libraryoriginally written for c++ (after winning a competition this library start adopted by machine learning community)has API in several languages:PythonRJuliaScalaJavaHow does XGBoost work?XGBoost is a popular and efficient open-source implementation of the gradient boosted trees algorithm.

Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models.

When using gradient boosting for regression, the weak learners are regression trees, and each regression treemaps an input data point to one of its leafs that contains a continuous score.

XGBoost minimizes a regularized (L1 and L2) objective function that combines a convex loss function (based on the difference between the predicted and target outputs) and a penalty term for model complexity (in other words, the regression tree functions).

The training proceeds iteratively, adding new trees that predict the residuals or errors of prior trees that are then combined with previous trees to make the final prediction.

It’s called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

(Source -AWS Documentation)To deep dive into it you can also read this XGBoost: A Scalable Tree Boosting System paperLet’s see some basic example of how do you use XGBoostNote- Make sure you have XGBoost installed.

If not refer to the installation guide.

For Anaconda Users Installation Guide.

XGBoost ParametersBefore running XGBoost, we must set three types of parameters: general parameters, booster parameters, and task parameters.

General parameters relate to which booster we are using to do boosting, commonly tree or linear modelBooster parameters depend on which booster you have chosenLearning task parameters decide on the learning scenario.

For example, regression tasks may use different parameters with ranking tasks.

For more refer to XGBoost Parameters.

Parameters———-1.

params : dict Booster params.

2.

dtrain : DMatrix Data to be trained.

3.

num_boost_round: int Number of boosting iterations.

4.

evals: list of pairs (DMatrix, string) List of items to be evaluated during training, this allows user to watch performance on the validation set.

5.

obj : function Customized objective function.

6.

feval : function Customized evaluation function.

7.

maximize : bool Whether to maximize feval.

8.

early_stopping_rounds: int Activates early stopping.

Validation error needs to decrease at least every <early_stopping_rounds> round(s) to continue training.

Requires at least one item in evals.

If there's more than one, will use the last.

Returns the model from the last iteration (not the best one).

If early stopping occurs, the model will have three additional 9.

fields: bst.

best_score, bst.

best_iteration and bst.

best_ntree_limit.

(Use bst.

best_ntree_limit to get the correct value if num_parallel_tree and/or num_class appears in the parameters)10.

evals_result: dict This dictionary stores the evaluation results of all the items in watchlist.

Example: with a watchlist containing [(dtest,'eval'), (dtrain,'train')] and a parameter containing ('eval_metric': 'logloss'), the **evals_result** returns .

code-block:: none {'train': {'logloss': ['0.

48253', '0.

35953']}, 'eval': {'logloss': ['0.

480385', '0.

357756']}}11.

verbose_eval : bool or int Requires at least one item in evals.

If **verbose_eval** is True then the evaluation metric on the validation set is printed at each boosting stage.

If **verbose_eval** is an integer then the evaluation metric on the validation set is printed at every given **verbose_eval** boosting stage.

The last boosting stage / the boosting stage found by using **early_stopping_rounds** is also printed.

Example: with “verbose_eval=4“ and at least one item in evals, an evaluation metric is printed every 4 boosting stages, instead of every boosting stage.

learning_rates: list or function (deprecated – use callback API instead) List of learning rate for each boosting round or a customized function that calculates eta in terms of current number of round and the total number of boosting round (e.

g.

yields learning rate decay)12.

xgb_model : file name of stored xgb model or 'Booster' instance Xgb model to be loaded before training (allows training continuation).

13.

callbacks : list of callback functions List of callback functions that are applied at end of each iteration.

It is possible to use predefined callbacks by using xgb.

callback module.

Example: [xgb.

callback.

reset_learning_rate(custom_rates)][ Note — I included main code only to demonstrate how XGBoost is used.

To see from the beginning refer to Github(link is provided at the end of the article).

]Code-X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.

25,random_state=1)In [10]:print(X_train.

shape)print(X_test.

shape)print(y_train.

shape)print(y_test.

shape)Out[10]:(435759, 54)(145253, 54)(435759,)(145253,)In [11]:xgb_cl = xgb.

XGBClassifier(n_estimators=15,learning_rate=0.

5,max_delta_step=5)In [12]:xgb_cl.

fit(X_train,y_train)Out[12]:XGBClassifier(base_score=0.

5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.

5, max_delta_step=5, max_depth=3, min_child_weight=1, missing=None, n_estimators=15, n_jobs=1, nthread=None, objective='multi:softprob', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1)In [13]:preds = xgb_cl.

predict(X_test)In [14]:# Compute the accuracy: accuracyaccuracy = float(np.

sum(preds==y_test))/y_test.

shape[0]print("accuracy: %f" % (accuracy))accuracy: 0.

735888Source https://xgboost.

readthedocs.

io/en/latest/tutorials/model.

htmlBecause XGBoost usually used with trees as base learner.

So, we need to understand what decision tree is?.How it’s work?Decision trees are supervised learning algorithms used for both, classification and regression tasks where we will concentrate on classification in this first part of our decision tree tutorial.

Decision trees are assigned to the information based learning algorithms which use different measures of information gain for learning.

We can use decision trees for issues where we have continuous but also categorical input and target features.

The main idea of decision trees is to find those descriptive features which contain the most “information” regarding the target feature and then split the dataset along the values of these features such that the target feature values for the resulting sub_datasets are as pure as possible →The descriptive feature which leaves the target feature most purely is said to be the most informative one.

This process of finding the “most informative” feature is done until we accomplish stopping criteria where we then finally end up in so-called leaf nodes.

The leaf nodes contain the predictions we will make for new query instances presented to our trained model.

This is possible since the model has kind of learned the underlying structure of the training data and hence can, given some assumptions, make predictions about the target feature value (class) of unseen query instances.

A decision tree mainly contains a root node, interior nodes, and leaf nodes which are then connected by branches.

Decision tree as a base learnerBase learner — Individual learning algorithm in an ensemble algorithmComposed of a binary question (yes or no)prediction happens at the “leaves”Decision trees are constructed iteratively(one decision at a time).

Until a stopping criterion is met.

During construction, the tree is built one split at a time and the way that a split is selected (that is what feature to split on and wherein the feature’s range of values to split)can vary, but involves choosing a split point that segregates the target value better( plus each category into buckets that are increasingly dominated by just one category)until all or nearly all values within the given split are exclusively of one category or another.

Using this process each leaf of the decision tree will have the single category in the majority or should be exclusively of one category.

Individual decision trees generally are low bias and high variance.

That is they are good at learning relationship within any data you train them on, but they tend to overfit the data you use to train them on and usually generalize to new data poorly.

XGBoost uses a slightly different kind of decision tree, called Classification and Regression tree(or CART).

Whereas for the decision tree described above the leaf nodes always contain decision values.

CART trees contain a real-valued score in each leaf, regardless of whether they are used in the classification or regression.

The real-valued scores can then be thresholded to convert into categories for classification problem if necessary.

In [15]:from sklearn.

tree import DecisionTreeClassifier# Create the training and test setsX_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.

20, random_state=1)# Instantiate the classifier: dt_clf_4dt_clf_4 = DecisionTreeClassifier(max_depth=4)# Fit the classifier to the training setdt_clf_4.

fit(X_train,y_train)# Predict the labels of the test set: y_pred_4y_pred_4 = dt_clf_4.

predict(X_test)# Compute the accuracy of the predictions: accuracyaccuracy = float(np.

sum(y_pred_4==y_test))/y_test.

shape[0]print("accuracy:", accuracy)Out[15]:accuracy: 0.

7004036040377615Now, we’ve reviewed basic of a decision tree.

Now, let’s talk about the core concept that gives XGBoost its state-of-the-art performance.

Boosting isn’t a specific machine learning algorithm, but the concept can be applied to a set of machine learning models.

So, it’s kind of meta-algorithm.

Specifically, it is an ensemble meta-algorithm primarily used to reduce any given single learner’s variance and to convert many weak learners into an arbitrarily strong learner.

Now, what is Weak learner and Strong learner?A weak learner is any machine learning algorithm that is just slightly better than chance.

For example- Decision tree that can predict some outcome slightly more frequently than pure randomness would be considered a weak learner.

The principal insight that allows XGBoost to work is the fact that you can use boosting to convert a collection of weak learners into strong learner.

Whereas, Strong learner is an algorithm that can be tuned to achieve arbitrarily good performance for some supervised learning problem.

How Boosting is accomplished?Iteratively learning a set of weak models on subsets of the data you have at hand, and weighting each of their predictions according to each weak’s learner’s performance.

Then we combine all of the weak’s learner predictions multiplied by their weights to obtain a single final weighted prediction that is much better than any of the individual predictions themselves.

Below explanation is taken from XGBoost Documentation.

We classify the members of a family into different leaves and assign them the score on the corresponding leaf.

A CART is a bit different from decision trees, in which the leaf only contains decision values.

In CART, a real score is associated with each of the leaves, which gives us richer interpretations that go beyond classification.

This also allows for a principled, unified approach to optimization, as we will see in a later part of this tutorial.

Usually, a single tree is not strong enough to be used in practice.

What is actually used is the ensemble model, which sums the prediction of multiple trees together.

Here is an example of a tree ensemble of two trees.

The prediction scores of each individual tree are summed up to get the final score.

If you look at the example, an important fact is that the two trees try to complement each other.

Mathematically, we can write our model in the formwhere K is the number of trees, f is a function in the functional space F, and F is the set of all possible CARTs.

The objective function to be optimized is given byNow here comes a trick question: what is the model used in random forests?.Tree ensembles!.So random forests and boosted trees are really the same models; the difference arises from how we train them.

This means that, if you write a predictive service for tree ensembles, you only need to write one and it should work for both random forests and gradient boosted trees.

(See Treelite for an actual example.

) One example of why elements of supervised learning rock.

Since we will be working with XGBoost’s learning API for model evaluation, next its good idea to briefly provide you with an example that shows how the model evaluation using cross-validation works with XGBoost’s learning API which is different from scikit-learn API.

Cross-validation is a robust method for estimating the estimated performance of a machine learning model on unseen data by generating many non-overlapping train/test splits into your training data and reporting the average test performance across all data splits.

If you plan to use XGBoost on a dataset which has categorical features you may want to consider applying some encoding (like one-hot encoding) to such features before training the model.

Also, if you have some missing values such as NA in the dataset you may or may not do a separate treatment for them, because XGBoost is capable of handling missing values internally.

In [26]:from sklearn.

metrics import mean_squared_errorX, y = data.

iloc[:,:-1],data.

iloc[:,-1]data_dmatrix = xgb.

DMatrix(data=X,label=y)XGBoost’s hyperparameters At this point, before building the model, you should be aware of the tuning parameters that XGBoost provides.

Well, there are a plethora of tuning parameters for tree-based learners in XGBoost and you can read all about them here.

But the most common ones that you should know are:learning_rate: step size shrinkage used to prevent overfitting.

Range is [0,1]max_depth: determines how deeply each tree is allowed to grow during any boosting round.

subsample: percentage of samples used per tree.

Low value can lead to underfitting.

colsample_bytree: percentage of features used per tree.

High value can lead to overfitting.

n_estimators: number of trees you want to build.

objective: determines the loss function to be used like reg:linear for regression problems, reg:logistic for classification problems with only decision, binary:logistic for classification problems with probability.

XGBoost also supports regularization parameters to penalize models as they become more complex and reduce them to simple (parsimonious) models.

gamma: controls whether a given node will split based on the expected reduction in loss after the split.

A higher value leads to fewer splits.

Supported only for tree-based learners.

alpha: L1 regularization on leaf weights.

A large value leads to more regularization.

lambda: L2 regularization on leaf weights and is smoother than L1 regularization.

It’s also worth mentioning that though you are using trees as your base learners, you can also use XGBoost’s relatively less popular linear base learners and one other tree learner known as dart.

All you have to do is set the booster parameter to either gbtree (default), gblinear or dart.

Now, I will create the train and test set for cross-validation of the results using the train_test_split function from sklearn’s model_selection module with test_size size equal to 20% of the data.

Also, to maintain reproducibility of the results, a random_state is also assigned.

Code-In [27]:X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.

2, random_state=123)In [28]:xg_reg = xgb.

XGBRegressor(objective ='reg:linear', colsample_bytree = 0.

3, learning_rate = 0.

1, max_depth = 5, alpha = 10, n_estimators = 10)In [29]:xg_reg.

fit(X_train,y_train)preds = xg_reg.

predict(X_test)In [30]:rmse = np.

sqrt(mean_squared_error(y_test, preds))print("RMSE: %f" % (rmse))Out[30]:RMSE: 10.

868649k-fold Cross Validation using XGBoostIn order to build more robust models, it is common to do a k-fold cross validation where all the entries in the original training dataset are used for both training as well as for validation.

Also, each entry is used for validation just once.

XGBoost supports k-fold cross validation via the cv() method.

All you have to do is specify the n-folds parameter, which is the number of cross-validation sets you want to build.

Also, it supports many other parameters like:num_boost_round: denotes the number of trees you build (analogous to n_estimators)metrics: tells the evaluation metrics to be watched during CVas_pandas: to return the results in a pandas DataFrame.

early_stopping_rounds: finishes training of the model early if the hold-out metric (“rmse” in our case) does not improve for a given number of rounds.

seed: for reproducibility of results.

This time I will create hyper-parameter dictionary params which hold all the hyper-parameters and their values as key-value pairs but will exclude the n_estimators from the hyper-parameter dictionary because I will use num_boost_rounds instead.

I will use these parameters to build a 3-fold cross validation model by invoking XGBoost’s cv() method and store the results in a cv_results DataFrame.

Note that here we are using the Dmatrix object you created before.

In [31]:params = {"objective":"reg:linear",'colsample_bytree': 0.

3,'learning_rate': 0.

1, 'max_depth': 5, 'alpha': 10}cv_results = xgb.

cv(dtrain=data_dmatrix, params=params, nfold=3, num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)Out[31]:[20:29:54] src/tree/updater_prune.

cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2[20:29:54] src/tree/updater_prune.

cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2[20:29:54] src/tree/updater_prune.

cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2In [32]:cv_results.

head()Out[32]:In [33]:print((cv_results["test-rmse-mean"]).

tail(1))Out[33]:49 3.

988943Name: test-rmse-mean, dtype: float64You can see that your RMSE for the price prediction has reduced as compared to last time and came out to be around 4.

03 per 1000$.

You can reach an even lower RMSE for a different set of hyper-parameters.

When Should I use XGBoost?You should consider using XGBoost for any supervised machine learning task that fits the following criteria:You have a large number of training examples( large can vary, I intend to mean the dataset that has few features and at least 2000 examples)However, in general, as long as the number of features in your training set is smaller than the number of examples you have.

The number of features < number of training samplesXGBoost intends to do well, when you have a mixture of Categorical and numerical features or when you have just numeric features.

When Should not use XGBoost?XGBoost is not ideally suited for image recognition, computer vision, or NLP & understanding problems, those kinds of problem can be much better tackled with Deep learning approaches.

Never go for XGBoost when you have a very small training setI hope you found this article useful and now you feel more confident to apply XGBoost in solving a data science problem.

Did you like this article?.Would you like to share some other hacks which you implement while making XGBoost models?.Please feel free to drop a note in the comments below and I’ll be glad to discuss.

Source Code — GitHubDataset Used — Forest Cover TypeOriginally published at TheMenYouWantToBe&Co.

on January 19, 2019.

.

. More details

Leave a Reply