Classification using Decision Trees

IntroductionData Scientists use machine learning techniques to make predictions under a variety of scenarios.

Machine learning can be used to predict whether a borrower will default on his mortgage or not, or what might be the median house value in a given zip code area.

Depending upon whether the prediction is being made for a quantitative variable or a qualitative variable, a predictive model can be categorized as regression model (e.

g.

predicting median house values) or classification (e.

g.

predicting loan defaults) model.

 Decision trees happen to be one of simplest and easiest classification models to explain and, as many argue, closely resemble human decision making.

This blog post has been developed to help you revisit and master the fundamentals of decision tree classification models.

Our key focus will be to discuss the:  Fundamental concepts on data-partitioning, recursive binary splitting, and nodes etc.

Data exploration and data preparation for building classification models Performance metrics for decision tree models – Gini Index, Entropy, and Classification Error.

The content builds your classification model knowledge and skills in an intuitive and gradual manner.

The ScenarioYou are a Data Scientist working at Centers for Disease Control (CDC) Division for Heart Disease and Stroke Prevention.

Your division has recently completed a research study to collect health examination data among 303 patients who were presented with chest pain and might have been suffering from heart disease.

The Chief Data Scientist of your division has asked you to analyze this  data and build a predictive model that can accurately predict patients heart disease status, identifying the most important predictors of heart failure.

Once your predictive model is ready, you will make a presentation to the doctors working at health facilities where the research was conducted.

The data set has 14 attributes including patients age, gender, blood pressure, cholesterol level, and their heart disease status indicating whether the diagnosed patient were found to have a heart disease or not.

You have already learned that to predict quantitative attributes such as “blood pressure” or “cholesterol level”, regression models are used but to predict a qualitative attribute such as the “status of heart disease”, classification models are used.

Classification models can be built using different techniques such as Logistic Regression, Discriminant Analysis, K-Nearest Neighbors (KNN), Decision Trees etc.

Decision Trees are very easy to explain and can easily handle qualitative predictors without the need to create dummy variables.

Although, decision trees generally do not have the same level of predictive accuracy as the K-Nearest Neighbor or Discriminant Analysis techniques, They serve as building blocks for other sophisticated classification techniques such as “Random Forest” etc.

which makes mastering Decision Trees, necessary!We will now build decision trees to predict status of heart disease i.

e.

to predict whether the patient has a heart disease or not, and we will learn and explore the following topics along the way: Data preparation for decision tree models Classification trees using “rpart” package Pruning the decision trees Evaluating decision tree models## You will need following libraries for this exercise library(dplyr) library(tidyverse) library(ggplot2) library(rpart) library(rpart.

plot) library(rattle) library(RColorBrewer) ## Following code will help you suppress the messages and warnings during package loading options(warn = -1) The DataYou will be working with the Heart Disease Data Set which is available at UC Irvines Machine Learning Repository.

You are encouraged to visit the repository and go through the data description.

As you will find, the data folder has multiple data files available.

You will use the processed.

cleveland.

data.

Lets read the datafile into a dataframe “cardio”## Reading the data into “cardio” data frame cardio <- read.

csv(“processed.

cleveland.

data”, header = FALSE, na.

strings = ?) ## Lets look at the first few rows in the cardio data frame head(cardio) V1V2V3V4V5V6V7V8V9V10V11V12V13V14 63 1 1 1452331 2 1500 2.

33 0 6 0 67 1 4 1602860 2 1081 1.

52 3 3 2 67 1 4 1202290 2 1291 2.

62 2 7 1 37 1 3 1302500 0 1870 3.

53 0 3 0 41 0 2 1302040 2 1720 1.

41 0 3 0 56 1 2 1202360 0 1780 0.

81 0 3 0 As you can see, this dataframe doesnt have column names.

However, we can refer to the data dictionary, given below, and add the column names:Column PositionAttribute NameDescriptionAttribute Type #1AgeAge of PatientQuantitative #2SexGender of PatientQualitative #3CPType of Chest Pain (1: Typical Angina, 2: Atypical Angina, 3: Non-anginal Pain, 4: Asymptomatic)Qualitative #4TrestbpsResting Blood Pressure (in mm Hg on admission)Quantitative #5CholSerum Cholestrol in mg/dlQuantitative #6FBS(Fasting Blood Sugar>120 mg/dl) 1=true; 0=falseQualitative #7RestecgResting ECG results (0=normal; 1 and 2 = abnormal)Qualitative #8ThalachMazimum Heart Rate AchievedQuantitative #9ExangExercise Induced Angina (1=yes; 0=no)Qualitative #10OldpeakST Depression Induced by Exercise Relative to RestQuantitative #11SlopeThe slope of peak exercise st segment (1=upsloping; 2=flat; 3=downsloping)Qualitative #12CANumber of major vessels (0-3) colored by flourosopyQualitative #13ThalThalassemia (3=normal; 6=fixed defect; 7=reversable defect)Qualitative #14NUMAngiographic disease status (0=no heart disease; more than 0=no heart disease)Qualitative The following code chunk will add column names to your dataframe:## Adding column names to dataframe names(cardio) <- c( “age”, “sex”, “cp”, “trestbps”, “chol”,”fbs”, “restecg”, “thalach”,”exang”, “oldpeak”,”slope”, “ca”, “thal”, “status”) You are going to build a decision tree model to predict values under variable #14 status, the “angiographic disease status” which labels or classifies each patient as “having heart disease” or “not having heart disease.

Intuitively, we expect some of these other 13 variables to help us predict the values under status.

In other words, we expect variables #1 to #13, to segment the patients or create partitions in the cardio data frame in a manner that any given partition (or segment) thus created either has patients as “having heart disease” or “not having heart disease.

 Data Preparation For Decision Trees It is time to get familiar with the data.

Lets begin with data types.

## We will use str() function str(cardio)data.

frame: 303 obs.

of 14 variables: $ age : num 63 67 67 37 41 56 62 57 63 53 .

$ sex : num 1 1 1 1 0 1 0 0 1 1 .

$ cp : num 1 4 4 3 2 2 4 4 4 4 .

$ trestbps : num 145 160 120 130 130 120 140 120 130 140 .

$ chol : num 233 286 229 250 204 236 268 354 254 203 .

$ fbs : num 1 0 0 0 0 0 0 0 0 1 .

$ restecg : num 2 2 2 0 2 0 2 0 2 2 .

$ thalach : num 150 108 129 187 172 178 160 163 147 155 .

$ exang : num 0 1 1 0 0 0 0 1 0 1 .

$ oldpeak : num 2.

3 1.

5 2.

6 3.

5 1.

4 0.

8 3.

6 0.

6 1.

4 3.

1 .

$ slope : num 3 2 2 3 1 1 3 1 2 3 .

$ ca : num 0 3 2 0 0 0 2 0 1 0 .

$ thal : num 6 3 7 3 3 3 3 3 7 7 .

$ status : int 0 2 1 0 0 0 3 0 2 1 .

As you can see, some qualitative variables in our data frame are included as quantitative variables status is declared as $$ which makes it a quantitative variable but we know disease status must be qualitative You can see that sex, cp, fbs, restecg, exang,  slope, ca and thal too must be qualitative The next code-chunk, will convert and correct the datatypes:## We can use lapply to convert data types across multiple columns cardio[c(“sex”, “cp”, “fbs”,”restecg”, “exang”, “slope”, “ca”, “thal”, “status”)] <- lapply(cardio[c(“sex”, “cp”, “fbs”,”restecg”, “exang”, “slope”, “ca”, “thal”, “status”)], factor) ## You can verify the data frame str(cardio) data.

frame: 303 obs.

of 14 variables: $ age : num 63 67 67 37 41 56 62 57 63 53 .

$ sex : Factor w/ 2 levels “0”,”1″: 2 2 2 2 1 2 1 1 2 2 .

$ cp : Factor w/ 4 levels “1”,”2″,”3″,”4″: 1 4 4 3 2 2 4 4 4 4 .

$ trestbps: num 145 160 120 130 130 120 140 120 130 140 .

$ chol : num 233 286 229 250 204 236 268 354 254 203 .

$ fbs : Factor w/ 2 levels “0”,”1″: 2 1 1 1 1 1 1 1 1 2 .

$ restecg : Factor w/ 3 levels “0”,”1″,”2″: 3 3 3 1 3 1 3 1 3 3 .

$ thalach : num 150 108 129 187 172 178 160 163 147 155 .

$ exang : Factor w/ 2 levels “0”,”1″: 1 2 2 1 1 1 1 2 1 2 .

$ oldpeak : num 2.

3 1.

5 2.

6 3.

5 1.

4 0.

8 3.

6 0.

6 1.

4 3.

1 .

$ slope : Factor w/ 3 levels “1”,”2″,”3″: 3 2 2 3 1 1 3 1 2 3 .

$ ca : Factor w/ 4 levels “0”,”1″,”2″,”3″: 1 4 3 1 1 1 3 1 2 1 .

$ thal : Factor w/ 3 levels “3”,”6″,”7″: 2 1 3 1 1 1 1 1 3 3 .

$ status : Factor w/ 5 levels “0”,”1″,”2″,”3″,.

: 1 3 2 1 1 1 4 1 3 2 .

Also note that status has 5 different values viz.

0, 1, 2, 3, 4.

While status = 0 , indicates no heart disease, all other values under status indicate a heart disease.

In this exercise, you are building a decision tree model to classify each patient as “normal”(not having heart disease) or “abnormal” (having heart disease)”.

Therefore, you can merge status = 1, 2, 3, and 4 into a single level status = “1”.

This way you will convert status into a  Binary or Dichotomous variable having only two values status = “0” (normal) and status = “1” (abnormal)Lets do that!## We will use the forcats package included in the stidyverse package ## The function to be used will be fct_collpase cardio$status <- fct_collapse(cardio$status, “1” = c(“1″,”2”, “3”, “4”)) ## Lets also change the labels under the “status” from (0,1) to (normal, abnormal) levels(cardio$status) <- c(“normal”, “abnormal”) ## levels under sex can also be changed to (female, male) ## We can change level names in other categorical variables as well but we are not doing that levels(cardio$sex) <- c(“female”, “male”) So, you have corrected the data types.

Whats next?How about getting a summary for all the variables in the data?## Overall summary of all the columns summary(cardio) age sex cp trestbps chol fbs Min.

:29.

00 female: 97 1: 23 Min.

: 94.

0 Min.

:126.

0 0:258 1st Qu.

:48.

00 male :206 2: 50 1st Qu.

:120.

0 1st Qu.

:211.

0 1: 45 Median :56.

00 3: 86 Median :130.

0 Median :241.

0 Mean :54.

44 4:144 Mean :131.

7 Mean :246.

7 3rd Qu.

:61.

00 3rd Qu.

:140.

0 3rd Qu.

:275.

0 Max.

:77.

00 Max.

:200.

0 Max.

:564.

0 restecg thalach exang oldpeak slope ca thal 0:151 Min.

: 71.

0 0:204 Min.

:0.

00 1:142 0 :176 3 :166 1: 4 1st Qu.

:133.

5 1: 99 1st Qu.

:0.

00 2:140 1 : 65 6 : 18 2:148 Median :153.

0 Median :0.

80 3: 21 2 : 38 7 :117 Mean :149.

6 Mean :1.

04 3 : 20 NAs: 2 3rd Qu.

:166.

0 3rd Qu.

:1.

60 NAs: 4 Max.

:202.

0 Max.

:6.

20 status normal :164 abnormal:139 Did you notice the missing values (NAs) under the ca and thal columns?.With the following code, you can count the missing values across all the columns in your data frame.

# Counting the missing values in the datframe sum(is.

na(cardio)) 6Only 6 missing values across 303 rows which is approximately 2%.

That seems to be a very low proportion of missing values.

What do you want to do with these missing values, before you started building your decision tree model?.Option1: discard the missing values before training?.Option2: rely on the machine learning algorithm to deal with missing values during the model training?.Option3: impute missing values before training?.For most learning methods, Option3 the imputation approach is necessary.

The simplest approach is to impute the missing values by mean or median of the non-missing values for the given feature.

Choice of Option 2 depends on learning algorithm.

Learning algorithms such as CART and rpart simply ignore missing values determining the quality of a split.

To determine, whether a case with a missing value forthe best split is to be sent left or right, the algorithm uses surrogate splits.

You may want to read more on this here.

However, if the relative amount of missing data is small, you can go for Option1 and discard the missing values as long as it doesnt lead to or further alleviates the class imbalance which is briefly discussed in a following section.

As for your data set, you are safe to delete missing value cases.

The following code-chunk does that for you.

## Removing missing values cardio <- na.

omit(cardio) Data ExplorationStatus is the variable that you want to predict with your model.

As we have discussed earlier, other variables in the cardio dataset should help you predict status.

For example, amongst patients with a heart disease, you might expect the average value of Cholesterol levels (chol), to be higher than amongst those who are normal.

Likewise, amongst patients with high blood sugar (fbs = 1), the proportion of patients with heart disease would be expected to be higher than what it is amongst patients are normal.

You can do some data visualization and exploration.

You may want to start with a distribution of  status.

Following code-chunk will provide you that:## plotting a histrogram for status cardio %>% ggplot(aes(x = status)) + geom_histogram(stat = count, fill = “steelblue”) + theme_bw() From this histogram, you can observe that there is almost an equal split between patients having status as normal  and abnormal.

This may not always be the case.

There might be datasets in which one of the classes in the predicted variable has a very low proportion.

Such datasets are said to have class imbalance problem where one of the classes in the predicted variable is rare within the dataset.

A Credit Card Fraud Detection Model or a Mortgage Loan Default Model are some examples of a classification models that are built with a dataset having a class imbalance problem.

What other scenarios come to your mind?You are encouraged to read this article: ROSE: A Package for Binary Imbalanced LearningYou should now explore the distribution of quantitative variables.

You can make density plots with frequency counts on Y-axis and split the plot by the two levels in the status variable.

Following code will produce the plots arranged in a grid of 2 rows## frequency plots for quantitative variables, split by status cardio %>% gather(-sex, -cp, -fbs, -restecg, -exang, -slope, -ca, -thal, -status, key = “var”, value = “value”) %>% ggplot(aes(x = value, y = .

count.

, colour = status)) + scale_color_manual(values=c(“#008000”, “#FF0000”))+ geom_density() + facet_wrap(~var, scales = “free”, nrow = 2) + theme_bw() What are your observations from the quantitative plots?.Some of your observations might be: In all the plots, as we move along the X-axis, the abnormal curve, mostly but not always, lies below the normal curve.

You should expect this as the total number of patients under abnormal is smaller.

However, for some values on the X-axis (could be smaller values of X or larger, depending upon the predictor), the abnormal curve lies above.

For example, look at the age plot.

Till x = 55 years, majority of patients are included in the normal curve.

Once x > 55 years, the majority goes to patients with abnormal and remains so until x = 68 years.

Intuitively, age could be a good predictor of status and you may want to partition the data at x = 55 years and then again at x = 68 years.

When you build your decision tree model, you may expect internal nodes with x > 55 years and x > 68 years.

Next, observe the plot for chol.

Except for a narrow range (x = 275 mg/dl to x = 300 mg/dl), the normal curve always lies above the abnormal curve.

You may want to form a hypothesis that Cholesterol is not a good predictor of status.

In other words, you may not expect chol to be amongst the earliest internal nodes in your decision tree model.

Likewise, you can make hypothesis for other quantitative variables as well.

Of course, your decision tree model will help you validate your hypothesis.

Now you may want to turn your attention to qualitative variables.

## frequency plots for qualitative variables, split by status cardio %>% gather(-age, -trestbps, -chol, -thalach, -oldpeak, -status, key = “var”, value = “value”) %>% ggplot(aes(x = value, color = status)) + scale_color_manual(values=c(“#008000”, “#FF0000”))+ geom_histogram(stat = count, fill = “white”) + facet_wrap(~var, nrow = 3) + facet_wrap(~var, scales = “free”, nrow = 3) + theme_bw() What are your observations from the qualitative plots?.How do you want to partition data along the qualitative variables?.Observe the cp or the chest pain plot.

Presence of asymptotic chest pain indicated by cp = 4, could provide a partition in the data and could be among the earliest nodes in your decision tree.

Likewise, observe the sex plot.

Clearly, the proportion of abnormal is much lower (approximately 25%) among the females compared to the proportion among males (approximately 50%).

Intuitively, sex might also be a good predictor and you may want to partition the patients data along sex.

When you build your decision tree model, you may expect an internal nodes with sex.

At this point, you may want to go back to both plots and list down the parititions (variables and, more importantly variable values) that you expect to find in your decision tree model.

Of course, all our hypotheses will get validated once we build our decision tree model.

Partitioning Data: Training and Test SetsBefore you start building your decision tree, split the cardio data into training set and test set:cardio.

train: 70% of the datasetcardio.

test: 30% of the datasetFollowing code-chunk will do that:## Now you can randomly split your data in to 70% training set and 30% test set ## You should set seed to ensure that you get the same training vs/ test split every time you run the code set.

seed(1) ## randomly extract row numbers in cardio dataset which will be included in the training set train.

index <- sample(1:nrow(cardio), round(0.

70*nrow(cardio),0)) ## subset cardio data set to include only the rows in train.

index to get cardio.

train cardio.

train <- cardio[train.

index, ] ## subset cardio data set to include only the rows NOT in train.

index to get cardio.

test ## Did you note the negative sign?.cardio.

test <- cardio[-train.

index, ] Classification Trees Using rpart “rpart” PackageYou will now use rpart package to build your decision tree model.

The decision tree that you will build, can be plotted using packages rpart.

plot or rattle which provides better looking plots.

You will use function rpart() to build your decision tree model.

The function has following key arguments:formula: rpart(, .

) The formula where you declare what all predictors you are using in your decision tree.

You can specify staus ~.

to indicate that you want to use all the predictors in your decision tree.

method: rpart(method = < >, .

) Same function can be used to build a decision tree as well as a regression tree.

You can use “class” to specify that you are using rpart() function for building a classification tree.

If you were building a regression tree, you would specify “anova” instead.

cp rpart(cp = <>,.

) The main role of Complexity Parameter (cp) is to control the size of the decision tree.

Any split that does not reduce the trees overall complexity by a factor of cp is not attempted.

The default value is  0.

01.

A value of cp = 1 will result in a tree with no splits.

Setting cp to a negative values ensures a fully grown tree.

minsplit  rpart( minsplit = <>, .

) The minimum number of observations that must exist in a node in order for a split to be attempted.

The default value is 20.

minbucket  rpart( minbucket = <>, .

) The minimum number of observations in any terminal  node.

If only one of minbucket or minsplit is specified, the code either sets minsplit to minbucket*3 or minbucket to minsplit/3, which is the default.

You are encouraged to read the package documentation rpart documentationYou can build a decision tree using all the predictors and with a cp = 0.

05.

The following code chunk will build your decision tree model:## using all the predictors and setting cp = 0.

05 cardio.

train.

fit <- rpart(status ~ .

, data = cardio.

train, method = “class”, cp = 0.

05) It is time to plot your decision tree.

You can use the function rpart.

plot() for plotting your tree.

However, the function fancyRpartPlot() in the rattle package is more fancy## Using fancyRpartPlot() from “rattle” package fancyRpartPlot(cardio.

train.

fit, palettes = c(“Greens”, “Reds”), sub = “”) Interpreting Decision Tree PlotWhat are your observations from your decision tree plot?Each square box is a node of one or the other type (discussed below):Root Node cp = 1, 2, 3: The root node represents the entire population or 100% of the sample.

Decision Nodes  thal = 3, and  ca = 0: These are the two internal nodes that get split-up either in further internal nodes or in terminal nodes.

There are 3 decision nodes here.

Terminal Nodes (Leaf): The nodes that do not split further, are called terminal nodes or leaf.

Your decision tree has 4 terminal nodes.

The decision tree plot gives the following information:Predictors Used in Model: Only the thal, cp, and ca variables are included in this decision tree.

Predicted Probabilities: Predicted probability of a patient being normal or abnormal.

Note that the two probabilities add to 100%, at each node.

Node Purities: Each node has two proportions written left and right.

The leftmost leaf has 0.

82 and 0.

18.

The number on left, 0.

82 tells you what proportion of the node actually belongs to the predicted class.

You can see that this leaf has 82% purity.

Sample Proportion: Each node has a proportion of the sample.

The proportion is 100% for the root node.

The percentages under the split-nodes add up to give the percentage in their parent node.

Predicted class: Each node shows the predicted class as  normal or abnormal.

It is the most commonly occurring predictor class in that node but the node  might still include observations belonging to the other predictor class as well.

This forms the concept of node impurity.

Fully Grown Decision TreeIs this the fully-grown decision tree?No!.Recall that you have grown the decision tree with default value of cp = 0.

05 which ensured that your decision tree doesnt include any split that does not decrease the overall lack of fit by a factor of 5%.

However, if you change this parameter, you might get a different decision tree.

Run the following code-chunk to get the plot of a fully grown decision tree, with a cp = 0## using all the predictors and setting all other arguments to default cardioFull <- rpart(status ~ .

, data = cardio.

train, method = “class”, cp = 0) ## Using fancyRpartPlot() from “rattle” package fancyRpartPlot(cardioFull, palettes = c(“Greens”, “Reds”),sub = “”) The fully grown tree adds two more predictors thal and oldpeak to the tree that you built earlier.

Now you have seen that changing the cp parameter, gives a decsion tree of different size – more nodes and/or more leaves.

At this stage, you might want to ask following questions: Which of the two decision trees you should go ahead with and present to your divisions Chief Data Scientist?.The one  developed with a default value of cp = 0.

01 or the one with cp = 0?. More details

Leave a Reply