Random Forest

They can also be more interpretable than other complex models such as neural networks.The content is organized as follows.What is a random forestInterpreting a random forestBias towards features with more categoriesHandling redundant featuresOutlier detectionClusteringWhy random forests are more accurate than single decision treesSoftware packagesWhat is a random forestA random forest consists of multiple random decision trees..The figure below illustrates the flow of applying a random forest with three trees to a testing data instance.The flow (highlighted in green) of predicting a testing instance with a random forest with 3 trees.Interpreting a random forestFeature importanceA feature’s importance score measures the contribution from the feature..The accuracy impact plot below shows X5’s accuracy impact is quite small compared to the truly informative features, indicating the feature is confusing the model and should be removed before fitting a classifier.Handling redundant featuresWhen features are similar to each other, the importance scores of these features can be misleading..The right figure below shows the importance scores from RRF.Outlier detection with random forestsClustering with random forests can avoid the need of feature transformation (e.g., categorical features)..The combined data set is shown in the right figure below.A random forest is built on the dataset..A random forest is then built for the classification problem.From the built random forest, a similarity score between each pair of data instances is extracted..While a single decision tree like CART is often pruned, a random forest tree is fully grown and unpruned, and so, naturally, the feature space is split into more and smaller regions.Trees are diverse..Each random forest tree is learned on a random sample, and at each node, a random set of features are considered for splitting..The boundary is smoother but makes obvious mistakes (overfitting).So how can random forests build unpruned trees without overfitting?For the two-class (blue and red) problem below, both splits x1=3 and x2=3 can fully separate the two classes.The two splits, however, result in very different decision boundaries..Decision trees often use the first variable to split, and so the ordering of the variables in the training data determines the decision boundary.Now consider random forests..For each random sample used for training a tree, the probability that the red point missing from the sample isSo roughly 1 out of 3 trees is built with all blue data and always predict class blue.. More details

Leave a Reply