Understanding Decision Trees (once and for all!) ????

????This plot on the left is the same as the previous one but with the first decision boundary of the tree : petal width = 0.

8 cm.

How was this decision boundary decided ?A decision boundary is decided by testing all the possible decision boundaries splitting the dataset and choosing the one that minimizes the Gini impurity of the two splits.

What is Gini impurity ?Gini impurity is a metric that measures the probability from a randomly chosen element (here an iris) to be incorrectly classified, i.

e.

the probability of choosing an element times the probability of being misclassified.

If we sum over all J possible classes we have the Gini impurity :The last expression is the one we are going to use to perform the Gini test.

Let’s compute the Gini impurity for the first nodeAt the root node all the data points are mixed.

Using the result above Gini impurity is :This gives us :We can verify this number by checking the Gini information on the root node : ‘gini = 0.

664’.

For the first node we have a Gini impurity of 0.

664.

Let’s get back to the first decision boundaryThe question to be asked to determine a decision boundary is : how to split the iris species so that we create more homogeneous groups ?Intuitively what we can observe on the graph above is that we can create a homogeneous group containing only setosa species just by splitting the dataset along the petal width axis.

But the algorithm has no intuition.

So how does it find the best split ?It will try all the possible boundaries along all the features, i.

e.

all the axes petal width and sepal width.

For each split the algorithm will compute the Gini impurity of the two groups created.

Finally it will choose the decision boundary that gives the lowest Gini impurity for the two groups (either summing the Gini impurity for each group or doing a mean).

Let’s get back to the first node and the first splitIn the case of the root node, the algorithm has found that among all the possible splits the split with petal width = 0.

8 cmgives the lowest Gini impurity.

The Gini impurity for the left leaf is :We verify this result with the tree graph.

This result is not surprising because in the left leaf which matches the left part of the graph we only have setosa iris, so the group is very homogeneous and Gini impurity is a measure of homogeneity.

The Gini impurity for the right leaf is :We find the same result as the one shown in the tree graph.

Moreover this Gini impurity is close to 0.

5 because there are almost as much virginica as versicolor irises.

Node 1The process described will continue iteratively until the tree succeeds or tries to separate all the data points or a restrictive condition is applied to the algorithm like a limitation in the depth of the tree.

As the Gini impurity is 0 for petal width <= 0.

8 cm, i.

e.

we cannot have a more homogeneous group, the algorithm will not try to split this part anymore and will focus on the right part of the tree.

Intuitively, the decision tree continues to use the petal width feature to split the right part in two.

Indeed, it seems easier to create homogeneous groups using petal width instead of sepal width.

Splitting at petal width <= 0.

8 cm creates a group with only virginica irises (so with a Gini impurity of 0).

Managing to create a group with uniquely one species is not always the best option though, as we will see for the next split…As the algorithm has created a node with only virginica, this node will never be split again and it will be a leaf.

Node 2For this node the algorithm chose to split the tree at petal width = 1.

55 cm creating two heterogeneous groups.

Intuitively we would have split at petal width = 1.

3 cm or sepal width = 3.

1 cm to create a group with only versicolor irises.

Indeed this would have created a node with a Gini impurity at 0.

But in fact the other node created is more heterogeneous, so much so that the Gini impurity of this node is bigger than the Gini impurity of the sum of the two nodes created with the other split.

Let’s verify this :Gini impurity with the split at petal width = 1.

55 cmLeft node :We verify this result on the tree.

Right node :Again we verify this result on the tree.

The Gini impurity for this split is :The Gini index of a split is ponderated by the number of points for each group.

Gini impurity with the split at petal width = 1.

3 cmLeft node :Right node :The Gini impurity for this split is :The algorithm is right and our intuition was wrong.

Indeed the first split produces the lowest Gini impurity so this split is preferable.

Reminder : The algorithm tries each possible split for each feature.

Node 3The first thing to notice is that the previous split has not changed the decision function of the tree below and above the split petal width = 1.

55 cm.

Indeed for both nodes created, the versicolor still holds the majority.

For this level, the decision tree finally uses the sepal width feature.

The two splits created are petal width = 2.

65 cm (in the subdivision 0.

8 cm < petal width <= 1.

55 cm ) and petal width = 3.

15 cm (in the subdivision 1.

55 cm < petal width <= 1.

75 cm).

Nodes 4, 5, 6Applying the same principle again and again the algorithm will try to isolate every point until it has only homogeneous groups.

This can lead to overfitting if we don’t limit the size of the tree for example.

(The tree is learning by heart the training set instead of understanding it which will prevent him from making good predictions on the testing set).

On the graph above we can see the decision boundaries being decided for the tree at depth 4, 5, 6.

The depth 6 is the depth of leaves and ends the building of the tree.

How does the built tree take a decision ?On the plot above we can see the training data (represented by o) on which the Decision Tree has been trained (and has overfitted) and the the testing data (represented by x).

When the Decision Tree has to predict a target, an iris species, for an iris belonging to the testing set, it travels down the tree from the root node until it reaches a leaf, deciding to go to the left or the right child node by testing the feature value of the iris being tested against the parent node condition.

For example, at the root node if the tested iris petal width <= 0.

8 cm it goes to the left node which is a leaf and classifies the iris as a setosa iris.

Otherwise, it goes to the right node and continues the same process until reaching a leaf.

As we have seen with the confusion matrix, two versicolor have been misclassified for virginica :in the region 1.

75 cm < petal width : we can see the yellow cross, i.

e.

it is a versicolor iris, on the blue background, i.

e.

the Decision Tree classifies it as a virginica.

in the region 1.

55 cm < petal width <= 1.

75 cm and sepal width <= 2.

75 cm: it is the same reasoning as above.

The rest of the testing irises have been well classified which gives us an accuracy of 0.

93.

During the testing phase, the algorithm takes every point and travels across the decision tree choosing the left or right node according to the feature value of the iris being tested.

ConclusionIn this article, we dissected Decision Trees to understand every concept behind the building of this algorithm that is a must know.

????To understand how a Decision Tree is built, we took a concrete example : the iris dataset made up of continuous features and a categorical target.

Decision Trees can also be built using categorical features (it is even simpler because one branch is one category) or a continuous target (here it may be a bit more complex because it does not use Gini impurity to measure the homogeneity but a variance metric…).

This could be the subject of another article….