Using mlr for Machine Learning in R: A Step By Step Approach for Decision Trees

Using mlr for Machine Learning in R: A Step By Step Approach for Decision TreesHyperparameter Tuning for optimizing performance.Asel MendisBlockedUnblockFollowFollowingNov 9I personally like to use mlr to conduct my machine learning tasks but you could just as well use any other library to your liking.First let’s load the relevant libraries:mlr for the machine learning algorithmsFSelector for Feature Selection..$ Outcome <fct> Positive, Negative, Positive, Ne…A look at the dataset I worked on in my previous post shows the variables we will be working with.Train and Test SetI am going to work with a 80/20 train/test dataset.set.seed(1000) train_index <- sample(1:nrow(Diabetes), 0.8 * nrow(Diabetes)) test_index <- setdiff(1:nrow(Diabetes), train_index) train <- Diabetes[train_index,] test <- Diabetes[test_index,]list( train = summary(train), test = summary(test) )The training set shows our target variable having 212 positive outcomes and 402 negative outcomes.The test set shows that we have 56 positive outcomes and 98 negative outcomes.There is an obvious class imbalance here with our target variable and because it is skewed towards ‘Negative’ (No Diabetes) we will find in harder to build a predictive model for a ‘Positive’ Outcome.You can solve this issue with re-balancing the classes which will involve re-sampling..I do not know if this would solve any underlying issues but threshold adjustment allows you to alter a prediction to give a completely different outcome.Decision Tree(dt_task <- makeClassifTask(data=train, target="Outcome"))Supervised task: train Type: classif Target: Outcome Observations: 614 Features: numerics factors ordered functionals 2 3 0 0 Missings: FALSE Has weights: FALSE Has blocking: FALSE Has coordinates: FALSE Classes: 2 Positive Negative 212 402 Positive class: PositiveFirst we have to make a classification task with our training set..This is where we can define which type of machine learning problem we’re trying to solve and define the target variable.As we can see, the Positive level of Outcome has defaulted to the Positive class in the machine learning task..In this case we want to predict the people that have diabetes (namely, the Positive level of the Outcome variable).Learner(dt_prob <- makeLearner('classif.rpart', predict.type="prob"))Learner classif.rpart from package rpart Type: classif Name: Decision Tree; Short name: rpart Class: classif.rpart Properties: twoclass,multiclass,missings,numerics,factors,ordered,prob,weights,featimp Predict-Type: prob Hyperparameters: xval=0After creating a classification task we need to make a learner that will later take our task to learn the data..I have chosen True Positive Rate and Area Under the Curve.generateFeatureImportanceData(task=dt_task, learner = dt_prob,measure = tpr, interaction = FALSE)FeatureImportance: Task: train Interaction: FALSE Learner: classif.rpart Measure: tpr Contrast: function (x, y) x – y Aggregation: function (x, …) UseMethod("mean") Replace: TRUE Number of Monte-Carlo iterations: 50 Local: FALSE tprPregnancies 0Glucose -0.1869811BMI -0.1443396DiabetesPedigreeFunction -0.06339623Age -0.06896226generateFeatureImportanceData(task=dt_task, learner = dt_prob,measure = auc, interaction = FALSE)FeatureImportance: Task: train Interaction: FALSE Learner: classif.rpart Measure: auc Contrast: function (x, y) x – y Aggregation: function (x, …) UseMethod("mean") Replace: TRUE Number of Monte-Carlo iterations: 50 Local: FALSE aucPregnancies 0Glucose -0.1336535BMI -0.07317023DiabetesPedigreeFunction -0.01907362Age -0.08251478As we can see with the above output:The information gain and gain ratio show a score of zero or a low score for Pregnancies.generateFeatureImportanceData shows a score of zero for Pregnancies when looking at the TPR and AUC as a performance measure.Looking at all the evidence, Pregnancies will be the only variable I discard..This is ongoing discussion on what is the right amount of data and features (Curse of Dimensionality).Looking below I have taken Pregnancies out of our train and test sets and made a new classification task with our new training set.set.seed(1000) train <- select(train, -Pregnancies) test <- select(test, -Pregnancies)list( train = summary(train), test = summary(test) )Another problem is that in the Glucose category, ‘Hypoglycemia’ has only 5 representations in the whole dataset..Therefore we need to remove Hypoglycemia from both datasets:train <- filter(train, Glucose!='Hypoglycemia') %>% droplevels() test <- filter(test, Glucose!='Hypoglycemia') %>% droplevels()list( train = summary(train), test = summary(test) )As we now have new datasets we need to make a new classification task based on the new training set.(dt_task <- makeClassifTask(data=train, target="Outcome"))Supervised task: train Type: classif Target: Outcome Observations: 609 Features: numerics factors ordered functionals 2 2 0 0 Missings: FALSE Has weights: FALSE Has blocking: FALSE Has coordinates: FALSE Classes: 2 Positive Negative 210 399 Positive class: PositiveHyper Parameter TuningNow any machine learning algorithm will require us to tune the hyperparameters at our own discretion..Tuning hyperparameters is the process of selecting a value for machine learning parameter with the target of obtaining your desired level of performance.Tuning a machine learning algorithm in mlr involves the following procedures:Define a search space.Define the optimization algorithm (aka tuning method).Define an evaluation method (i.e. re-sampling strategy and a performance measure).Search SpaceSo defining a search space is when specify parameters for a given feature/variable..I also want to get the standard deviation of the True Positive Rate from the test set during the cross validation..If the performance results begin to diverge too much, the data may be inadequate.In the optimal hyperparameters, the standard deviation of the True Positive Rate in the test set is 0.0704698, which is relatively low and can give us an idea of the True Positive Rate we will obtain later when predicting..This was the result:[Tune] Result: minsplit=17; minbucket=7; cp=0.0433; maxcompete=4; usesurrogate=0; maxdepth=7 : tpr.test.mean=0.6904762,auc.test.mean=0.7277720,f1.test.mean=0.6156823,acc.test.mean=0.7283265, mmce.test.mean=0.2716735,timepredict.test.mean=0.0000000,tnr.test.mean=0.7460928Although the TPR is higher, I am going to use my previous hyperparameters because it is less computationally expensive.Optimal HyperParameterslist( `Optimal HyperParameters` = dt_tuneparam$x, `Optimal Metrics` = dt_tuneparam$y )$`Optimal HyperParameters` $`Optimal HyperParameters`$minsplit [1] 9 $`Optimal HyperParameters`$minbucket [1] 2 $`Optimal HyperParameters`$cp [1] 0.01444444 $`Optimal HyperParameters`$maxcompete [1] 6 $`Optimal HyperParameters`$usesurrogate [1] 0 $`Optimal HyperParameters`$maxdepth [1] 10 $`Optimal Metrics` tpr.test.mean auc.test.mean fnr.test.mean mmce.test.mean 0.60952381 0.78073756 0.39047619 0.27257800 tnr.test.mean tpr.test.sd 0.78947368 0.07046976Using dt_tuneparam$x we can extract the optimal values and dt_tuneparam$y gives us the corresponding performance measures.setHyperPars will tune the learner with its optimal values.dtree <- setHyperPars(dt_prob, par.vals = dt_tuneparam$x)Model Trainingset.seed(1000) dtree_train <- train(learner=dtree, task=dt_task) getLearnerModel(dtree_train)n= 609 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 609 210 Negative (0.34482759 0.65517241) 2) Glucose=Hyperglycemia 149 46 Positive (0.69127517 0.30872483) 4) BMI=Obese 117 28 Positive (0.76068376 0.23931624) * 5) BMI=Normal,Overweight 32 14 Negative (0.43750000 0.56250000) * 3) Glucose=Normal 460 107 Negative (0.23260870 0.76739130) 6) Age>=28.5 215 78 Negative (0.36279070 0.63720930) 12) BMI=Underweight,Overweight,Obese 184 77 Negative (0.41847826 0.58152174) 24) DiabetesPedigreeFunction>=0.5275 61 23 Positive (0.62295082 0.37704918) * 25) DiabetesPedigreeFunction< 0.5275 123 39 Negative (0.31707317 0.68292683) * 13) BMI=Normal 31 1 Negative (0.03225806 0.96774194) * 7) Age< 28.5 245 29 Negative (0.11836735 0.88163265) *rpart.plot(dtree_train$learner.model, roundint=FALSE, varlen=3, type = 3, clip.right.labs = FALSE, yesno = 2)Decision Tree Classification of the Pima Indian Diabetes datasetrpart.rules(dtree_train$learner.model, roundint = FALSE)Outcome 0.24 when Glucose is Hyperglycemia & BMI is Obese 0.38 when Glucose is Normal & BMI is Underweight or Overweight or Obese & Age >= 29 & DiabetesPedigreeFunction >= 0.53 0.56 when Glucose is Hyperglycemia & BMI is Normal or Overweight 0.68 when Glucose is Normal & BMI is Underweight or Overweight or Obese & Age >= 29 & DiabetesPedigreeFunction < 0.53 0.88 when Glucose is Normal & Age < 29 0.97 when Glucose is Normal & BMI is Normal & Age >= 29After training the decision tree I was able to plot it with the rpart.plot function and I can easily see the rules of the tree with rpart.rules..Since mlr is a wrapper for machine learning algorithms I can customize to my liking and this is just one example.Model Prediction (Testing)We now pass the trained learner to be used to make predictions with our test data.set.seed(1000) (dtree_predict <- predict(dtree_train, newdata = test))Prediction: 154 observations predict.type: prob threshold: Positive=0.50,Negative=0.50 time: 0.00 truth prob.Positive prob.Negative response 1 Negative 0.3170732 0.6829268 Negative 2 Positive 0.6229508 0.3770492 Positive 3 Negative 0.4375000 0.5625000 Negative 4 Negative 0.3170732 0.6829268 Negative 5 Positive 0.7606838 0.2393162 Positive 6 Negative 0.1183673 0.8816327 Negative …. More details

Leave a Reply