An Introduction to Random Forest

They can also be more interpretable than other complex models such as neural networks.The content is organized as follows.What is a random forestInterpreting a random forestBias towards features with more categoriesHandling redundant featuresOutlier detectionClusteringWhy random forests are more accurate than single decision treesSoftware packagesWhat is a random forestA random forest consists of multiple random decision trees..First, each tree is built on a random sample from the original data..The figure below illustrates the flow of applying a random forest with three trees to a testing data instance.The flow (highlighted in green) of predicting a testing instance with a random forest with 3 trees.Interpreting a random forestFeature importanceA feature’s importance score measures the contribution from the feature..Right: when X1 alone is not correlated to the class, partial dependence can be misleading.inTreesNeither importance scores nor partial dependency plots tell how multiple features interact with the class..Even the feature is irrelevant to the class, the importance score of X5 is larger than truly informative features X2 and X3, indicating an incorrect bias towards features with more categories.X1, X2, and X3 are truly informative, X4 and X5 are irrelevant, and X5 has many categories.One solution is to perform feature selection..The right figure below shows the importance scores from RRF.Left: feature importance from a random forest; Right: feature importance from a regularized random forest.Outlier detection with random forestsClustering with random forests can avoid the need of feature transformation (e.g., categorical features)..class 1: original data; class 2: the same size as the original data but with X1 and X2 randomly permuted.A random forest is built on the dataset..Right: feature importance score in outlier detection.Clustering with random forestsSimilar to outlier detection, clustering with random forests saves efforts in feature preprocessing.The procedure is similar to outlier detection..A random forest is then built for the classification problem.From the built random forest, a similarity score between each pair of data instances is extracted..While a single decision tree like CART is often pruned, a random forest tree is fully grown and unpruned, and so, naturally, the feature space is split into more and smaller regions.Trees are diverse..For each random sample used for training a tree, the probability that the red point missing from the sample isSo roughly 1 out of 3 trees is built with all blue data and always predict class blue.. More details

Leave a Reply