Decision Tree Regressor explained in depth

Decision Tree Regressor explained in depthGeorgios DrakosBlockedUnblockFollowFollowingMay 22Decision Tree algorithm has become one of the most used machine learning algorithm both in competitions like Kaggle as well as in business environment.

Decision Tree can be used both in classification and regression problem.

This article present the Decision Tree Regression Algorithm along with some advanced topics.

Table of ContentsIntroductionImportant TerminologyHow does it work?How is Splitting Decided for Decision Trees?When not to use?Advantage & DisadvantagesTechniques to avoid over-fittingDecision Trees Vs Random ForestsHyperparameters of Sklearn Decision TreeTree-based models Vs Linear modelsVisualize a Decision Tree modelConclusionReferencesIntroductionDecision Trees can be summarized with the below bullet points:Decision trees are predictive models that use a set of binary rules to calculate a target value.

Each individual tree is a fairly simple model that has branches, nodes and leaves.

Important TerminologyBefore diving into let’s look at the basic terminology used with decision trees:Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.

Splitting: It is a process of dividing a node into two or more sub-nodes.

Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.

Leaf/Terminal Node: Nodes do not split is called Leaf or Terminal node.

Pruning: When we remove sub-nodes of a decision node, this process is called pruning.

You can say opposite process of splitting.

Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.

Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent node.

How does it work?A decision tree is arriving at an estimate by asking a series of questions to the data, each question narrowing our possible values until the model get confident enough to make a single prediction.

The order of the question as well as their content are being determined by the model.

In addition, the questions asked are all in a True/False form.

This is a little tough to grasp because it is not how humans naturally think, and perhaps the best way to show this difference is to create a real decision tree from.

In the above problem x1, x2 are two features which allow us to make predictions for the target variable y by asking True/False questions.

For each True and False answer, there are separate branches.

No matter the answers to the questions, we eventually reach a prediction (leaf node).

Start at the root node at the top and progress through the tree answering the questions along the way.

So given any pair of X1, X2.

One aspect of the decision tree I should mention is how it actually learns (how the ‘questions’ are formed and how the thresholds are set).

As a supervised machine learning model, a decision tree learns to map data to outputs in what is called the training phase of model building.

During training, the model is fitted with any historical data that is relevant to the problem domain and the true value we want the model to learn to predict.

The model learns any relationships between the data and the target variable.

After the training phase, the decision tree produces a tree similar to the one shown above, calculating the best questions as well as their order to ask in order to make the most accurate estimates possible.

When we want to make a prediction the same data format should be provided to the model in order to make a prediction.

The prediction will be an estimate based on the train data that it has been trained on.

How is Splitting Decided for Decision Trees?The decision of making strategic splits heavily affects a tree’s accuracy.

The decision criteria are different for classification and regression trees.

Decision trees regression normally use mean squared error (MSE) to decide to split a node into two or more sub-nodes.

 Suppose we are doing a binary tree the algorithm first will pick a value, and split the data into two subsets.

For each subset, it will calculate the MSE separately.

The tree chooses the value with results in the smallest MSE value.

Let’s examine how is Splitting Decided for Decision Trees Regressor in more details.

The first step to create a tree is to create the first binary decision.

How are you going to do it?We need to pick a variable and the value to split on such that the two groups are as different from each other as possible.

For each variable, for each possible value of the possible value of that variable see whether it is better.

How to determine if it is better?.Take weighted average of two new nodes (mse*num_samples)To sum up, we now have:A single number that represents how good a split is which is the weighted average of the mean squared errors of the two groups that create.

A way to find the best split which is to try every variable and every possible value of that variable and see which variable and which value gives us a split with the best score.

This is the entirety of creating a decision tree regressor and will stop when some stopping condition (defined by hyperparameters) is met:When you hit a limit that was requested (for examplemax_depth)When your leaf nodes only have one thing in them (no further split is possible, MSE for the train will be zero but will overfit for any other set -not a useful model)Are there circumstances when it is better to split into 3 groups ?.It is never necessary to do more than one split at a level because you can just split them again.

When not to use?As we saw previously, a decision tree will be unable to make accurate predictions if the data that are provided to split out a prediction are unrelated to the data that it has been trained on.

Let’s demonstrate that argument with an example:import numpy as npimport matplotlib.

pyplot as pltfrom sklearn.

tree import DecisionTreeRegressorx = np.

linspace(0,1)y = x + np.

random.

uniform(-0.

2,0.

2,x.

shape)plt.

scatter(x,y)We’re now going to build a decision tree model and what kind of acts as if this is a time series problem.

So I’m going to take left part as a training set.

And take the right part as our validation set.

# train, validation set splitx_trn, x_val = x[:40,None], x[40:,None]y_trn, y_val = y[:40,None], y[40:,None]# fit a modelm = DecisionTreeRegressor(max_depth=6).

fit(x_trn, y_trn)Please note that DecisionTreeRegressor expects a 2D array (or an array of rank 2) and not a 1D array into a 2D array.

Let’s now plot the result:plt.

scatter(x_val,m.

predict(x_val),color='blue',label='Prediction') plt.

scatter(x_val,y_val,color='red',label='Actual') plt.

scatter(x_trn,m.

predict(x_trn),color='blue') plt.

scatter(x_trn,y_trn,color='red') plt.

legend(loc='upper left')If you don’t know how decision trees work then this is going to render the model useless.

In other words, decision tree and tree-based models in general are unable to extrapolate to any kind of data they haven’t seen before, particularly future time period as it’s just averaging data points it has already seen.

In a nutshell Decision trees and tree-based models in general just do a clever nearest neighbors.

In this kind of problems where any tree-based algorithm is useless neural net or linear regression model are the preferred models.

The reason is that you want to use a model that actually has a function or shape that can actually fit something so it can extrapolate nicely.

One major disadvantage of Decision Trees is that they are prone to overfitting.

That’s why they are rarely used and instead other tree-based models are preferred like Random Forest and XGBoost.

To sum up, Decision tree and general tree-based methods are a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems.

As noted above they are used in regression problems if and only if the target variable is inside the range of values that they have seen in the train dataset.

Advantage & Disadvantages of Decision TreesAdvantagesIt can be used for both Classification and Regression problemsEasy to Understand, Interpret, VisualiseUseful in Data exploration: Decision tree is one of the fastest ways to identify the most significant variables and the relation between two or more variables.

With the help of decision trees, we can create new variables/features that have a better power to predict the target variable.

You can refer article (Trick to enhance power of regression model) for one such trick.

It can also be used in the data exploration stage.

Less data preparation required: It is not influenced by outliers and missing values to a fair degree.

Data type is not a constraint: It can handle both numerical and categorical variables.

Non-Parametric Method: Decision tree is considered to be a non-parametric method.

This means that decision trees have no assumptions about space distribution and the classifier structure.

Can capture Nonlinear relationshipsDisadvantagesOver fitting: Over fitting is one of the most practical difficulties for decision tree models.

This problem gets solved by setting constraints on model parameters and pruning (discussed in detail below).

Not fit for continuous variables: While working with continuous numerical variables, decision tree loses information when it categorizes variables in different categories.

Cannot extrapolate.

Decision trees can be unstable: Small variations in the data might result in a completely different tree being generated.

This is called variance, which needs to be lowered by methods like bagging and boosting.

No Guarantee to return the globally optimal decision tree.

This can be mitigated by training multiple trees, where the features and samples are randomly sampled with replacement.

Techniques to avoid over-fittingOften you may find that you’ve overfitted your model to the data, which is often detrimental to the model’s performance when you introduce new data.

If there is no limit set of a decision tree, it will give you a zero MSE on training set because in the worse case it will end up making 1 leaf for each observation.

Thus, preventing overfitting is of major importance when training a decision tree and it can be done in 2 ways:Setting constraints on tree size (fine-tune hyperparameters)Tree pruningUse a Random ForestLet’s discuss both of these briefly.

Setting Constraints on Tree SizeThis can be done by fine-tuning some hyperparameters which are used to define a tree.

First, let’s look at the general structure of a decision tree:By understanding the role of parameters used in tree modeling will help you to better fine-tuned a decision tree both in R & Python.

Minimum samples for a node splitDefines the minimum number of samples (or observations) which are required in a node to be considered for splitting.

Used to control over-fitting.

Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

Too high values can lead to under-fitting hence, it should be fine-tuned with care.

Minimum samples for a terminal node (leaf)Defines the minimum samples (or observations) required in a terminal node or leaf.

Used to control over-fitting similar to Minimum samples for a node splitToo high values can lead to under-fitting hence, it should be fine-tuned with care.

Maximum depth of the tree (vertical depth)Used to control over-fitting as higher depth will allow the model to learn relations very specific to a particular sample.

Too high values can lead to over-fitting hence, it should be fine-tuned with care.

Maximum features to consider for a splitThe number of features to consider while searching for the best split.

These will be randomly selected.

As a thumb-rule, square root of the total number of features works great but we should check up to 30–40% of the total number of features.

Higher values can lead to over-fitting but depend on case to case.

Tree PruningAs discussed earlier, the technique of setting constraint is a greedy-approach.

In other words, it will check for the best split instantaneously and move forward until one of the specified stopping condition is reached.

One of the questions that arise in a decision tree algorithm is the optimal size of the final tree.

A tree that is too large risks overfitting the training data and poorly generalizing to new samples.

A small tree might not capture important structural information about the sample space.

However, it is hard to tell when a tree algorithm should stop because it is impossible to tell if the addition of a single extra node will dramatically decrease error.

This problem is known as the horizon effect.

A common strategy is to grow the tree until each node contains a small number of instances then use pruning to remove nodes that do not provide additional information.

Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross-validation set.

There are many techniques for tree pruning that differ in the measurement that is used to optimize performance.

One of the simplest forms of pruning is reduced error pruning.

Starting at the leaves, each node is replaced with its mean value.

If the MSE is not affected then the change is kept.

While somewhat naive, reduced error pruning has the advantage of simplicity and speed.

Use Random ForestIt seems that not many people actually take the time to prune a decision tree or fine tuning but rather they select to use a random forest regressor (a collection of decision trees) which are less prone to overfitting and perform better than a single optimized tree.

Decision Trees Vs Random ForestsThe common argument for using a decision tree over a random forest is that decision trees are easier to interpret, you simply look at the decision tree logic.

However, in a random forest, you’re not going to want to study the decision tree logic of 500 different trees.

At the same time, Random Forest is actually a collection of Decision Trees which makes the algorithm slow and ineffective for real-time predictions.

In general, RF can be fast to train, but quite slow to create predictions once they are trained.

this is due to the fact that it has to run predictions on each individual tree and then average their predictions to create the final prediction.

(a solution to that can be to use cluster fitting different trees at the same time)A more accurate prediction requires more trees, which results in a slower model.

In most real-world applications the random forest algorithm is fast enough, but there can certainly be situations where run-time performance is important and other approaches would be preferred.

Finally, Decision Trees do suffer from overfitting while Random Forest can prevent overfitting resulting in better prediction most of the time.

This is a significant advantage of the Random Forest regression model fact which makes it more appealing to a lot of data scientists.

Hyperparameters of Sklearn Decision Treemin_samples_leaf : int, float, optional (default=1)It is the minimum number of samples for a terminal node that we discuss above.

If int, then consider min_samples_leaf as the minimum number.

If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

min_samples_split : int, float, optional (default=2)It is the minimum samples for a node split that we discuss above.

If int, then consider min_samples_split as the minimum number.

If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

max_features : int, float, string or None, optional (default=”auto”)The number of features to consider when looking for the best split:If int, then consider max_features features at each split.

-If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.

If “auto”, then max_features=sqrt(n_features).

If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

If “log2”, then max_features=log2(n_features).

If None, then max_features=n_features.

max_depth : integer or None, optional (default=None)The maximum depth of the tree.

If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

Tree-based models Vs Linear modelsActually, you can use any algorithm.

It is dependent on the type of problem you are solving.

Let’s look at some key factors which will help you to decide which algorithm to use:If the relationship between dependent & independent variable is well approximated by a linear model, linear regression will outperform tree-based model.

If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method.

If you need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model.

Decision tree models are even simpler to interpret than linear regression!Let’s prove our argument on points 1 and 2 with an example:import numpy as npimport matplotlib.

pyplot as pltfrom sklearn.

tree import DecisionTreeRegressorfrom sklearn.

linear_model import LinearRegressionplt.

figure(figsize=(20,10))x = np.

linspace(0,2,10000)y = 1+ 3*xplt.

scatter(x,y)We’re now going to build a decision tree and a linear model.

We expect the linear model to fit perfectly as the target variable is linearly related to feature x.

So I’m going to shuffle the data and consider 70% as a training set and 30 % as validation.

idxs_train = sorted(np.

random.

permutation(len(x))[:int(0.

7*len(x))])idx_test = [i for i in range(0,len(x)) if i not in idxs_train]# train, validation set splitx_trn, x_val = x[idxs_train,None], x[idx_test,None]y_trn, y_val = y[idxs_train,None], y[idx_test,None]# fit a modeldt = DecisionTreeRegressor(max_depth=3).

fit(x_trn, y_trn)l = LinearRegression().

fit(x_trn, y_trn)Please note that both models expect a 2D array (or an array of rank 2) and not a 1D array into a 2D array.

Let’s now plot the result:plt.

figure(figsize=(20,10))plt.

scatter(x,dt.

predict(x[:,None]),color='blue',label='Prediction DT')plt.

scatter(x,l.

predict(x[:,None]),color='green',label='Prediction Linear')plt.

scatter(x,y,color='red',label='Actual')plt.

legend(loc='upper left')In a way, Decision tree tries to fit a staircase.

Obviously, in this case, linear model works nicely but on other cases where the target variable is not linearly correlated with the input features a DT will be a better option as it will be able to capture nonlinearities.

Find also below another example from scikit-learn documentation:Visualize a Decision Tree modelVisualizing a Decision Tree is fairly simple.

Let’s draw the decision tree that was trained above.

First, we will need to import some additional libraries:from sklearn.

tree import export_graphvizimport IPython, graphviz, re, mathAnd then we can simply draw the Decision Tree as below:def draw_tree(t, col_names, size=9, ratio=0.

5, precision=3): """ Draws a representation of a random forest in IPython.

Parameters: ———– t: The tree you wish to draw df: The data used to train the tree.

This is used to get the names of the features.

""" s=export_graphviz(t, out_file=None, feature_names=col_names, filled=True, special_characters=True, rotate=True, precision=precision) IPython.

display.

display(graphviz.

Source(re.

sub('Tree {', f'Tree {{ size={size}; ratio={ratio}',s)))col_names =['X']draw_tree(dt, col_names, precision=3)As you can see we’re taking a subset of the data, and deciding the best manner to split the subset further.

Our initial subset was the entire data set, and we split it according to the rule X<=1.

001.

Then, for each subset, we performed additional splitting until we were able to correctly predict the target variable while respecting the constraint of max_depth=3.

ConclusionDecision trees are one of the most widely-used machine learning models, due to the fact that they work well with noisy or missing data and can easily be ensembled to form more robust predictors.

Moreover, you can directly visual your model’s learned logic, which means that it’s an incredibly popular model for domains where model interpretability is important.

Thanks for reading and I am looking forward to hearing your questions :)Stay tuned and Happy Machine Learning.

Referenceshttps://scikit-learn.

org/stable/modules/tree.

htmlhttps://en.

wikipedia.

org/wiki/Decision_tree_pruninghttps://stackoverflow.

com/questions/49428469/pruning-decision-treesOriginally published at https://gdcoder.

com on May 23, 2019.

.. More details

Leave a Reply