Everything You Need to Know About Decision TreesIntro to decision trees, random forests, bagging, boosting, and the underlying theoryMarco PeixeiroBlockedUnblockFollowFollowingJan 16Photo by Jay Mantri on UnsplashTree-based methods can be used for regression or classification.

They involve segmenting the prediction space into a number of simple regions.

The set of splitting rules can be summarized in a tree, hence the name decision tree methods.

A single decision tree is often not as performant as linear regression, logistic regression, LDA, etc.

However, by introducing bagging, random forests, and boosting, it can result in dramatic improvements in prediction accuracy at the expense of some loss in interpretation.

In this post, we introduce everything you need to know about decision trees, bagging, random forests, and boosting.

It will be a long read, but it will be worth it!Here’s Bob Ross painting a treeBasics of decision treesRegression treesBefore getting to the dirty theory, we need some basic terminology.

Trees are drawn upside down.

The final regions are termed leaves.

The points inside the tree where a split occurs is an interval node.

Finally, segments that connect nodes are branches.

Regression tree schematicTo create a regression tree:Divide the predictor space into J distinct and non-overlapping regionsFor every observation that falls in a region, predict the mean of the response value in that regionEach region is split to minimize the RSS.

To do so, it takes a top-down greedy approach also called recursive binary splitting.

Why top-down?Because all observations are in a single region before the first split.

Why a greedy approach?Because the best split occurs at a particular step, rather than looking ahead and making a split that will result in a better prediction in a future step.

Mathematically, we define the pair of half-planes as:and we seek j and s to minimize:However, this may lead to overfitting.

Pruning the tree will result in a smaller subtree that we can validate with cross-validation.

Schematic of an unpruned treeClassification treeA classification tree is very similar to a regression tree.

However, we cannot use the mean value of the response, so we now predict the most commonly occurring class in a region.

Of course, RSS cannot be used as a criterion.

Instead, each split is done to minimize the classification error rate.

The classification error rate is simply the fraction of training observations in a region that do not belong to the most common class.

Classification error rate formulaUnfortunately, this is not sensitive enough for tree-growing.

In practice, two other methods are used.

There is the Gini index:Gini indexThis is a measure of total variance across all classes.

As you can see, the Gini index will be small if the proportion is close to 0 or 1, so it is a good measure of node purity.

A similar rationale is applied to the other method called cross-entropy:Cross-entropyNow that we have seen how a basic decision tree works, let’s see how we can improve its performance!Bagging, random forests and boostingBaggingWe have previously seen that bootstrap can compute the standard deviation of any quantity of interest.

For decision trees, the variance is very high.

Therefore, with bootstrap aggregation or bagging, we can reduce the variance and increase the performance of a decision tree.

Bagging involves repeatedly taking samples from a dataset.

This generates B different bootstrap training sets.

Then, we train on all bootstrapped training sets to get a prediction for each set, and we average the predictions.

Mathematically, the average prediction is:Applying this to decision trees, it means that that we can construct a high number of trees which will have high variance and low bias.

Then, we can average their predictions to reduce the variance to improve the performance of the decision trees.

Random forestsRandom forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees.

Like in bagging, multiple decision trees are built.

However, at each split, a random sample of m predictors is chosen from all p predictors.

The split is allowed to use only one of the m predictors, and typically,In other words, at each split, the algorithm is not allowed to consider a majority of the available predictors!Why?Suppose that there is one very strong predictor in the dataset, along with other moderately strong predictors.

Then, in the collection of bagged trees, they will all use this strong predictor in the top split.

Consequently, all of the bagged trees will be very similar, and averaging their predictions will not reduce variance, since the predictions would be highly correlated.

Random forests overcome this problem by forcing each split to only consider a subset of predictors which effectively decorrelates the trees.

Of course, if m is equal to p, then it is just like bagging.

Usually, the square root of p gives the best results as shown below.

Classification error as a function of the number of trees.

Each line represents the number of predictors available at each split.

BoostingBoosting works in a similar way to bagging, but the trees are grown sequentially: each tree uses information from the previously grown trees.

This means that the algorithm learns slowly.

Each tree is fit to the residuals from the model rather than to the target variable.

Hence, each tree is small and will slowly improve predictions in areas where it does not perform well.

There are three tuning parameters in boosting:number of tree (B): unlike bagging and random forests, boosting can overfit if B is too large.

Use cross-validation to choose the right amount of trees.

shrinkage parameter (alpha): a small positive number that controls the learning rate of boosting.

It is typically set to 0.

01 or 0.

001.

number of splits in each tree (d): it controls the complexity of the boosted ensemble.

Usually, a single split (d = 1) works well.

It is also called the interaction depth.

Classification error as a function of the number of trees.

Each line represents a different interaction depth.

As you can see above, an interaction depth of 1 seems to give the best results.

Wow, that was a lot of information!.You now know everything about decision trees, random forests, boosting and bagging.

Decision trees can be used for both regression and classification, and when combined to boosting, bagging, they perform extremely well.

In a future post, I will show how to implement decision trees in a real-world situation using Python!Leave any questions or suggestions in the comment section!.I look forward to reading them!Cheers!.. More details