Decision Tree for Better Usage

The answer is ‘it depends’.

We should rather focus on how they are different.

From Introduction to Data MiningWe learnt that the maximum of Gini is 0.

5 which is different than that of Entropy.

Namely, Entropy hikes up more sharply than Gini.

This means it can penalize an impure node harder.

Another difference is that Misclassification Error is not smooth like the other two.

For example, an increase on the x-axis from 0.

1 to 0.

2 adds the same impurity as an increase from 0.

3 to 0.

4, which is linear.

On the contrary, Gini sees impurity at 0.

4 as almost same as impurity at 0.

5, so does even more Entropy.

This means Gini and Entropy are stricter against impurity.

And which one to use depends on the situation.

Continuous FeaturesWe started with categorical features, and this probably didn’t strike anyone as strange.

But Decision Tree can handle continuous features gracefully.

Let’s take a look at the example.

We have a series of values; 10, 20, …, 100, each of which is labeled as True or False.

First thing to do is ordering them, and we can grab a point between some two values.

Let’s say we split at 45.

Theoretically, there are infinite choices to split continuous values.

There are four values less than 45, and three of them are False with one of them True.

The same rule is applied to the right greater than 45.

Now, we can calculate the Gini for each group.

Of course, we need to put weights on them to get the final Gini.

We do this to all split points possible between values.

As seen on the table, the lowest Gini is at the split point 35 of which the left is perfectly pure (all False classes) and the right has 7 classes with 5 of them being True classes.

The calculation seems to be tiring, but when you slide through the values the True or False counts change one by one.

This means that each calculation only needs to make a slight change for the next calculation (machines do that for us, anyway).

In conclusion, continuous features are handled very similarly with categorical features but with more calculation.

Non-linearity and RegressionStraight to the point, Decision Tree is a non-linear model.

If an example is classified by a feature node, how much doesn’t matter.

In the figure, there are some extreme values like -205 or 15482.

However, they all belong to just one node with any other number satisfying the condition.

They don’t get any special titles such as ‘Super Yes’ or ‘Double No’.

At this point, Decision Tree seems to be unable to do regression.

But yes, it can.

Let’s look at some analogies.

We all know that an ANOVA model can do regression in which averages of categorical predictors represent certain group of samples.

A similar idea can be applied to Decision Tree.

Some statistics such as mean or median of examples in a node can represent the nodeFrom Scikit-Learn WebsiteFirst, it is obvious that Decision Tree is non-linear from the box-like lines.

One more takeaway here is that it shows how Decision Tree can over-fit examples.

‘Max-depth’ controls how complex a tree can be built.

We can see that a tree with Max-depth set to 5 is trying so hard to fit all the far-off examples at the cost of the model being so complex.

Greedy AlgorithmDecision Tree is a greedy algorithm which finds the best solution at each step.

In other words, it may not find the global best solution.

When there are multiple features, Decision Tree loops through the features to start with the best one that splits the target classes in the purest manner (lowest Gini or most information gain).

And it keeps doing it until there is no more feature or splitting is meaningless.

Why does this sometimes fail to find the true best solution?To understand what is actually happening to ‘Greedy’ manner, we assume that we are going from Point A to Point B.

On a greedy standard, we don’t see the red way-points on Point A but only see the blue points.

Standing on Point A, we decide the bottom because this is shorter than the top.

But in the next step, we find the red point, and so on, ending up with a longer trip.

The best choice made at each step does not guarantee the global best.

The same could happen in Decision Tree because it doesn’t combine all the possible pairs.

Imagine there are hundreds of features if not thousands, categorical and continuous mixed, each of which has all different values and ranges.

Mixing them up all the way possible should be prohibitive.

Knowing that Decision Tree is greedy is important to understand why the advanced methods like Bagging works.

Tendency to Over-fitLike every other model, Decision Tree has an issue with over-fitting.

It could be a bit worse than other models in a way.

An extreme example is an ID feature.

We never want to use this as a feature anyway, but let’s consider it as such for understanding.

It will keep trying to split if there remains any features to do soThis is a multi-way split where each ID has a branch and each node has only one case to make decision since ID’s are unique.

Now, it becomes a look-up table.

This is merely a bunch of hard-coded ‘if statement’.

The problem arises when it sees a new example.

Then, the model will blow up because it doesn’t have any function in it.

If it was a binary split Decision Tree, it would be so deep as the number of examples.

Realistically, we won’t ever use an ID as a feature, but this kind of event could still happen if a feature has a variety of values that can draw so many subtle boundaries or we let it grow in depth as much as it needs to fit all examples.

There are also multiple ways to prevent this.

Pruning is one of them which keeps a tree from growing deeper.

“Pre-pruning” controls the number of examples each node must have to split further.

“Post-pruning” lets it expand, then finds over-fitting nodes.

In a more sophisticated way, we can consider Ensemble methods which let multiple learners make individual decisions and put them together for the final decision.

Bagging and Random ForestBootstrap aggregating, so called Bagging, an Ensemble model in which there are multiple learners.

Its goal is to prevent over-fitting and find a better solution.

As we can guess from its name, it utilizes the Bootstrap sampling method, which means that we draw samples ‘with replacement’.

This is another example of over-fitting where the bottom node is split too much only to fit 2 examples.

So, the two examples must be very exceptional in that all the upper nodes failed to uniquely separate them.

Again, the node is now like a look-up table.

What if we don’t include the 2 examples when building a tree?Bagging draws samples at random, letting some drawn more or some not drawn at allTree 1 is built without the two exceptional examples.

Now, we give it the two exceptions for testing, and it lands on a node with many examples.

This no longer works like a look-up table simply because it didn’t even see the examples.

They could still be exceptional in other trees, but the final decision is made by all the ‘n’ trees voting.

In short, random sampling prevents a model from memorizing examples and focuses on a general consensus.

Bagging selects features at random as well.

Another characteristic of Bagging is that it shuffles features as well.

What it means is randomly selecting some features, not all.

For example, we have 100 features and should guess some classes.

We can use all of them as pure Decision Tree does, but we choose just some of them at random like using only 30 or 50 or any.

It sounds so strange because it looks like we are not making full use of features.

Remembering the Greedy algorithm example of finding a shorter route, what if we didn’t even have the choice of the bottom route?.Because we didn’t know that route, we would have gone with the top route that would result in a shorter trip overall.

By having less features, we can prevent a combination of features which leads to a local optimum but not the global.

By the nature of random selection, it may result in the worse.

That is why we should repeat so many times the random sampling as well as random feature selection so it can find the best of combination biased toward general decision standards.

But this may still not be the true global best because randomizing all these doesn’t mean we go through all the possibilities yet (NP-Complete problem).

One thing to note is that Bagging is an Ensemble method that can be applied to any model, not just Decision Tree.

When Bagging is applied to Decision Tree, it becomes Random Forest just like trees comprise a forest.

Demo in RLoan dataset is intuitive and easy to understand.

We have 300 examples and 11 features such as Age, Gender, Loan Amount, Income, and Loan Status as the target.

This is a binary classification guessing a loan to be in default or not.

The code uses the minimum libraries so it can focus on simulations.

Also, Cross-validation was not used for evaluation to keep it simple (still fair because the same training/testing set were used for all the models).

The full code and the data can be found on Github.

Full Model — Evaluated against Training Set# Classification model with all featuresclf <- rpart(formula = Loan_Status ~.

, data = training.

set, control = rpart.

control(minsplit = 1))# Predict from training set and calculate accuracypred <- predict(clf, newdata = training.

set[, -12], type = 'class')round(sum(training.

set[, 12]==pred)/length(pred), 2)# Plot the treerpart.

plot(clf, box.

palette = 'RdBu', nn = T)The model was built using all of the features, and it was tested against the training set, not the testing set.

This was intended to see how over-fitting would happen.

*Words on the graph may not be legible but not important at this point.

Please focus on the shape especially the depth and the number of nodes.

Accuracy = 0.

96The tree is so deep and complex.

Its accuracy is 0.

96, almost perfect, but tested against its own training set.

Now, let’s evaluate the model against the testing set (unseen data).

Full Model — Evaluated against Testing Set# Evaluate against testing setpred <- predict(clf, newdata = testing.

set[-12], type = 'class')round(sum(testing.

set[, 12]==pred)/length(pred), 2)Note that the parameter ‘newdata’ is now ‘testing.

set’.

The result is 0.

63 which is much less than what’s tested against the training set.

This means that the model using all features and all examples didn’t successfully generalize it.

One thing we can do here first is to tweak the parameter.

Limiting the depth can be helpful because the model won’t grow that deep, preventing over-fitting.

All accuracy hereafter is calculated from the testing set.

Full Model with Depth Limit# Set maxdepth to 3clf <- rpart(formula = Loan_Status ~.

, data = training.

set, control = rpart.

control(minsplit = 1, maxdepth = 3))# Evaluate against testing setpred <- predict(clf, newdata = testing.

set[-12], type = 'class')round(sum(testing.

set[, 12]==pred)/length(pred), 2)Evaluated against the testing set, the accuracy is now 0.

68 which is higher than 0.

63.

Also, the tree is a lot simpler without even using all of the features.

Accuracy = 0.

68Reduced Model with Random Sampling — Random Forest# Build 100 trees with max nodeclf <- randomForest(x = training.

set[-12], y = training.

set$Loan_Status, ntree = 100, maxnodes = 5)We build 100 trees with up to 5 maximum nodes for each.

The accuracy is 0.

72, higher than ever.

This means that the model found some way to handle exceptional examples.

When we want to improve it even more, should we just increase the number of trees?.Unfortunately, it might not be too helpful as Random Forest is not magic.

If lucky enough, it could be amazingly better only until it sees completely new data.

The improvement is based on a consensus, meaning that it gives up on minor examples in order to fit more general examples.

ConclusionWe learnt the fundamentals of Decision Tree, focusing on the split.

This was shown for both categorical and continuous values.

Also, we checked out some characteristics of Decision Tree such as non-linearity, ability of regression, and greedy algorithm.

For over-fitting, controlling depth and Bagging were introduced with the demo in R.

In the real world, these wouldn’t always be working because we shouldn’t expect data to come as clean as we wish they were.

However, knowing these properties and parameters will be a good start for creative tries.

Good ReadsIntroduction to Data Mining: Chapter 4.

ClassificationScikit-Learn Example: Decision Tree RegressionRandom Forest Video Guide: Introduction to basic ideas‘rpart’ R Library: Tutorial and more details‘randomForest R Library: Simple tutorialYou are encouraged to put forward your opinion.

Comment here!.Or talk to me via LinkedIn if you’d like to point out something wrong on this posting quietly… Also, you’re welcome to visit my portfolio website!.

. More details

Leave a Reply