Light on Math Machine Learning: Intuitive Guide to Understanding Decision TreesIntroductionThis article aims at introducing decision trees; a popular building block of highly praised models such as xgboost. A decision tree is simply a set of cascading questions. When you get a data point (i.e. set of features and values), you use each attribute (i.e. a value of a given feature of the data point) to answer a question. The answer to each question decides the next question. At the end of this sequence of questions, you will end up with a probability of the data point belonging to each class.Note: This article is behind the Medium paywall. However the code is opensource, and can be accessed from this link.Note 2: This is the fifth blog post in the series of light on math machine learning A-Z. You can find the previous blog posts linked to the letter below.A B C D E F G H I J K L* M N O P Q R S T U V W X Y Z* denote articles behind the Medium paywallWhy this article?Let’s take a step back here! Why should you be reading this article, when there is so many articles explaining decision trees? One point to highlight is that, unlike the articles I found in the internet, this article will not stop at a high level explanation and leave the reader hanging. In contrast, I will try to,Provide a good intuition of the functionality of the model,Do justice to the crux of the model,Explain how overfitting can occur and how to prevent itReinforce the whole explanation with examples.I would also like to allude to an important caveat! The objective of assimilating the nitty-gritties is not to be able to implement a decision tree from scratch (at least not what I expect), but,provide a good understanding of what various parameters (most of them) mean when you use them in practiceHowever the code I provide is a decision tree build from scratch and meant as a guide for you to see the process end to end. What better way is there to get a good understanding, other than actually seen the nuts and bolts!Why decision trees?Decision trees are great for several reasons. But as with any model, the true power comes when you use a model in a suitable context. Remember,garbage in, garbage out!Decision trees are great in the sense thatThey are easily interpretable and follows a similar pattern to human thinking. In other words, you can explain a decision tree as a set of questions/business rules.Prediction is fast. It’s a set of comparison operations until you reach a leaf nodeCan be adapted to deal with missing data without imputing dataPrimer on decision trees:As I said earlier, a decision tree is basically set of cascading questions (formed to a shape of a tree). The questions test whether a feature of the data satisfies a certain condition (e.g. is Outlook == Sunny ?). Let’s flesh this out a little bit. For that let’s assume a typical weather dataset (I personally don’t like the weather dataset — too mainstream, so I’ll switch to something a bit more exciting later).Say we have the above dataset, and would like to predict if we would play golf or not given the weather. The decision tree for this problem might look like the one below.An example decision tree. Round nodes denote decision nodes, where square nodes denote leaf nodesComponents of a decision treeLet me give you a brief anatomy lesson of a decision tree. The tree has decision nodes (round), decisions (edges), and leaf/prediction nodes (square). You first start off with a decision node (e.g. Outlook) and depending on the answer, you might have a leaf node, or another decision node (e.g. Windy). A decision tree could go on for an arbitrary number of decision nodes. However, each branch should end with a leaf node.Predicting with a decision treeNow say you have a new datapoint, which you would like to predict the label for.Outlook = Sunny, Temperature = Hot, Humidity = Normal, Windy = TrueIn order to predict, you start from the top of the tree and traverse deeper and deeper using the attributes (i.e. feature values) to make decisions along the tree, until you reach a leaf node. This process is depicted below.Guess you’re not playing Golf.Building a treeOf course there is a more important question that still remains.How can we arrive at this tree?More importantly,How can we decide the best feature partition data at a given depth?There are several different different algorithm used to generate trees such as,CART (Classification and regression trees) — Uses Gini coefficient to decide partition featureID3 (Uses information gain – discussed later, to decide the partition feature, and not designed to deal with continuous features)C4.5 (Works similar to ID3 by using information gain to split data. However C4.5 can handle continuous features, as well as can work with missing data)Let’s now move onto solving a more exciting fictional problem. We want to find an evil King who is terrifying streets of TreeLand.Find the disguised king on the streets(Data and features)From this section onward, I am going to explain decision trees with a story of the king of TreeLand who loved fooling locals by disguising himself on the streets. This was annoying because whenever someone unintentionally treat the king badly, he would be punished. People were quite vexed, and one wise man (let’s call him John) stepped up and said he can solve this in 10 days!John assumed several characteristics of the king that cuts a distinctive figure when he is in public. The features are,Comes out of the castleWalks slowlyEats 5 or more times dailyHas a gold toothJohn started observing people and creating a dataset (one datapoint per person) along with the target variable (i.e. whether the person was actually the king or not). He has the following observations. Unfortunately this dataset came at the cost of suffering of five people.With this dataset, John did the following. He wrote down each feature against the target variable, and highlighted each entry according to the following logic:Green — If the attribute matched the corresponding labelRed — If the attribute did not match the corresponding labelJohn’s whiteboardNow, looking at the above tables, we can get a sense of which features contribute the most to correctly predicting the King. For example, the following diagram summarises the observations from the above tables.Summary of the observations from the highlighted tablesWell instincts is not good enough to call you solved a problem. You need quantifiable metrics to show that these features are powerful. In fact, half of the citizens in TreeLand are machine learning engineers! This is where the information gain comes in.Information gain: How much more information gained by splitting data?We already discussed that the basic idea behind a tree is to ask a question (i.e. test data against an attribute) and split the data depending on the answer (e.g. True or False). Information gain measures how predictable the data became (or how much more information gained) by splitting the data according to the answer provided at the decision node. Put on your science goggles, time to formalise. Information gained using feature F on data Y is defined by:IG(Y, F) = H(Y)-H(Y|F)where Y is the target variable and F is some feature (e.g. Gold Tooth). So the more powerful a feature is, the more information is gained by splitting the data on that feature. H(Y) is the entropy of the Y and H(Y|F) is the conditional entropy of Y conditioned on F (discussed soon). In code, computing information gain looks as follows.Let us now look at what entropy and conditional entropy is.Entropy and Conditional EntropyEntropy measures how many bits are required to transform data. So the more predictable your data, the lesser the number of bits required. In contrast, the less predictable your data, the more bits you need. This is written down as follows:where (y in Y) denotes the unique outcomes in Y (e.g. Is King = True or False in our example). You can implement this in Python quite easily.Entropy in layman’s termsLet’s understand entropy more intuitively. Say the King decided to never fool the locals in the first place, and John decided to collect data to solve this problem. He would have had an entropy of zero. Why? Because the label is 100% predictable (i.e. it is always 0). But, in our data, the entropy is at its highest (i.e. 1), because there is a equal chance of a person being a king or not.Imagine two people (John and Lisa) talking over the phone. Every time John calls Lisa, Lisa has to say if it was the king or not for each datapoint (i.e. 10 times) without any additional information. Below I depict how the conversation will go down for different entropy values.Difference of different entropy valuesConditional entropy in layman’s termsThis sets up the ground nicely to explain what conditional entropy is (i.e. H(Y|F). Imagine that now, for each datapoint, John tells whether the person had a Gold Tooth or not. Then, depending on the feature value (i.e. F), Lisa has to predict if that was the king or not. If you look at the table below, the data is now more predictable than not having any features.From Jonh’s whiteboardIn summary, what information gain does is that,it measure the difference in predictability before and after providing information about the feature F.To remind ourselves again, here’s the formula for the information gain.IG(Y, F) = H(Y)-H(Y|F)Meanwhile in TreeLand …(Building the tree)John computes the information gain for both Castle and Gold Tooth and find out that they are much better than Slow and Greedy. Then John first picks the feature “Castle”.Tree with the feature “Castle”He’s happy with the left side. He thinks, “4/6 classification is not bad”. Given that he only has 10 datapoints, we’ll cut him some slack. But the results from the right side are poor (i.e. it’s 50% True or 50% False). He needs to improve the results, so he goes ahead and try to split the remainder on the left using the feature that gives the highest information gain, which turns out to be the “Gold Tooth” feature, and obtains the following.Tree with both “Castle” and “Gold Tooth” featuresSo there you go! John has found the secret sauce! This is in fact how the ID3 algorithm works. More formally you would have the following pseudocode.PeseudocodeThe most important function in this code would be build_tree which in reality would look like this (available in the code).Treeland citizens can relax again …So after John showed what he came up with, people were impressed. But John knew that more data would not only improve his model, but also validate his findings. So he asked others to help him collect more data. But little did he know, he’ll run into problems doing so.The tree is grew too much …(Overfitting and regularisation)So John managed to get several people onboard with his idea to augment the dataset and improve the model. They managed to get the dataset up to 50 datapoints. So after collecting data, dreaming about a better more robust model, John created a new tree. Here’s the tree he got.New decision treeThis tree instantly raised red flags in John’s mind. The tree is way too complicated to solve this problem. It should be enough to have only two or three features in a simple tree structure. So what happened here? This is called overfitting! John did not use any regularisation to control the complexity of the tree, leading to yielding a non-generalising and uninterpretable tree model that defeats the purpose.Let’s pivot back and correct John’s mistakesThere are several ways you can combat overfitting in tree models. I will be discussing three methods in this section. They are:Making sure each leaf node has at least n datapointsMaking sure the depth of the tree does not exceed dMaking sure the information gain is greater than a threshold to make a splitMaking sure that each leaf node has at least n number of datapoints in a leaf node is important. Because, you can build the world’s perfect tree (heavily overfitted) to any training set, by having a single datapoint assigned to each leaf node in the tree. Therefore, making sure there is at least n number of datapoints (e.g. say 5% of the data) leads to more generalised tree models.Making sure that the tree does not exponentially grow, will again guard us against overfitting. In this method, we are basically limiting the number of questions we can ask to predict the class of a model, which leads to choosing the best few questions to ask, rather than asking all possible questions in the world.The last one, making sure that the information gain is above a threshold should make sense. Because, what’s the point of growing the tree, if we’re not getting a benefit from it?So, after revising these techniques and applying them to the model, John came up with the following model, which is more explainable.Regularised treeI am not going to reiterate about predicting with trees, as the process is quite straight forward. This can be implemented in Python using recursive functions as below.With this we conclude the adventure in TreeLand. In conclusion, John collected data, built an initial model using information gain to decide valuable features. Ran into trouble with overfitting and solved it using regularisation. The good news is that, they have learnt to avoid king treating the wrong way using this model John developed. So they lived happily ever after!Working with continuous valuesSo far, we limited ourselves to only looking at discrete variables. But there’s so much you can do only with discrete variables. The interesting problems usually have a mix of both continuous and discrete variables. So how can we use continuous variables in decision trees? There are two popular ways.Discretize the continuous feature by binning the range of a continuous feature to equal binsUse “minimal entropy partitioning”[2] to find a split in the continuous featureMinimal entropy partitioningThe minimal entropy partitioning (MEP) works as follows.You first nominate a set of cut points depending on the range the data spreads within (e.g. Cut the range at T locations equally spread)Then for each cut point, compute the weighted sum of entropy of the label YPick the cut point, that gives the minimum entropyWrap upIn this article we discussed decision trees in great detail. Decision trees is a popular machine learning model, because they are more interpretable (e.g. compared to a neural network) and usually gives good performance, especially when used with ensembling (bagging and boosting).We first briefly discussed the functionality of a decision tree while using a toy weather dataset as an example. Then we jumped to a more lively problem, which was to identify if the annoying King of TreeLand is terrifying the streets. We saw how John used information gain to identify features that can help to identify the King and subsequently built a tree using those features. Afterwards, we saw how problems such as “overfitting” arose as the dataset grew larger. We also discussed how overfitting can be prevented using various regularisation methods (e.g. limiting the maximum depth of the tree). Finally we concluded the adventure in TreeLand and did a quick side tour to see how continuous features can be handled in decision trees.The code for this article is available here.Reference[1] Lecture notes from CMU[2] Multi interval discretization of continuous valued attributes for classification learningVideo from Google Developers Youtube ChannelVideo from Stanford