Feature Selection Using Random forest

In random forests, the impurity decrease from each feature can be averaged across trees to determine the final importance of the variable.To give a better intuition, features that are selected at the top of the trees are in general more important than features that are selected at the end nodes of the trees, as generally the top splits lead to bigger information gains.Let's see some Python code on how to select features using Random forest.Here I will not apply Random forest to the actual dataset but it can be easily applied to any actual dataset.Importing librariesimport pandas as pdfrom sklearn.ensemble import RandomForestClassfierfrom sklearn.feature_selection import SelectFromModel2..In all feature selection procedures, it is a good practice to select the features by examining only the training set..This is to avoid overfitting.So considering we have a train and a test dataset..We select the features from the train set and then transfer the changes to the test set later.X_train,y_train,X_test,y_test = train_test_split(data,test_size=0.3)3..Here I will do the model fitting and feature selection altogether in one line of code.Firstly, I specify the random forest instance, indicating the number of trees.Then I use selectFromModel object from sklearn to automatically select the features.sel = SelectFromModel(RandomForestClassifier(n_estimators = 100))sel.fit(X_train, y_train)SelectFromModel will select those features which importance is greater than the mean importance of all the features by default, but we can alter this threshold if we want.4..To see which features are important we can use get_support method on the fitted model.sel.get_support()It will return an array of boolean values..True for the features whose importance is greater than the mean importance and False for the rest.5..We can now make a list and count the selected features.selected_feat= X_train.columns[(sel.get_support())]len(selected_feat)It will return an Integer representing the number of features selected by the random forest.6..To get the name of the features selectedprint(selected_feat)It will return the name of the selected features.7..We can also check and plot the distribution of importance.pd.series(sel.estimator_,feature_importances_,.ravel()).hist()It will return a histogram showing the distribution of the features selected using this feature selection technique.We can of course tune the parameters of the Decision Tree.Where we put the cut-off to select features is a bit arbitrary..One way is to select the top 10, 20 features..Alternatively, the top 10th percentile.. More details

Leave a Reply