A Guide to Decision Trees for Machine Learning and Data Science

A Guide to Decision Trees for Machine Learning and Data ScienceGeorge SeifBlockedUnblockFollowFollowingNov 30Decision Trees are a class of very powerful Machine Learning model cable of achieving high accuracy in many tasks while being highly interpretable..The “knowledge” learned by a decision tree through training is directly formulated into a hierarchical structure..We’ve built a tree to model a set of sequential, hierarchical decisions that ultimately lead to some final result..The decisions will be selected such that the tree is as small as possible while aiming for high classification / regression accuracy.Decision Trees in Machine LearningDecision Tree models are created using 2 steps: Induction and Pruning..Pruning is the process of removing the unnecessary structure from a decision tree, effectively reducing the complexity to combat overfitting with the added bonus of making it even easier to interpret.InductionFrom a high level, decision tree induction goes through 4 main steps to build the tree:Begin with your training dataset, which should have some feature variables and classification or regression output.Determine the “best feature” in the dataset to split the data on; more on how we define “best feature” laterSplit the data into subsets that contain the possible values for this best feature..This splitting basically defines a node on the tree i.e each node is a splitting point based on a certain feature from our data.Recursively generate new tree nodes by using the subset of data created from step 3..For a classification, we use the Gini Index Function:Where pk are the proportion of training instances of class k in a particular prediction node..On the otherhand, if our split has a high percentage of each class for each output, then we have gained the information that splitting in that particular way on that particular feature variable gives us a particular output!Now we could of course keep splitting and splitting and splitting until our tree has thousands of branches……Below we will colour the nodes based on the feature names and display the class and feature information of each node.There are several parameters that you can set for your decision tree model in Scikit Learn too..If we sort our data on each feature beforehand, our training algorithm will have a much easier time finding good values to split on.Tips for Practically Applying Decision TreesHere are a few of the pro and cons of decision trees that can help you decide on whether or not it’s the right model for your problem, as well as some tips as to how you can effectively apply them:ProsEasy to understand and interpret..That’s a huge plus since it means that having more data won’t necessarily make a huge dent in our inference speed.ConsOverfitting is quite common with decision trees simply due to the nature of their training..It’s often recommended to perform some type of dimensionality reduction such as PCA so that the tree doesn’t have to learn splits on so many featuresFor similar reasons as the case of overfitting, decision trees are also vulnerable to becoming biased to the classes that have a majority in the dataset.. More details

Leave a Reply