Evaluating the effectiveness of feature selection algorithms for automating the feature selection process

And how can we use accumulated knowledge to improve the FS algorithm?Common feature selection methodsMany feature selection methods have been proposed and studied for machine learning applications.

They can be divided into four broad categories: the Embedded, Wrapper, Filter, and Hybrid approaches [1].

Filter methodsFilter methods are preprocessing methods, independent of the learning algorithms, with good generality.

They attempt to assess the merits of features from the data, ignoring the effects of the selected feature subset on the performance of the learning algorithm [1].

Their computational complexity is low, but the accuracy of the learning algorithms is not guaranteed to be improved.

For examples there are methods that are ranking the features according to: variance, correlation, univariate feature selection (selection based on univariate statistical tests such as: chi2 test, f classif test), ranking through compression techniques, like PCA or by computing correlation with the output (e.


Gram-Schmidt, mutual information).

Wrapper methodsWrapper methods are methods that assess subsets of variables according to their usefulness to a given predictor.

These methods uses the predictive accuracy of a predetermined learning algorithm to determine the goodness of the selected subsets.

The two main disadvantages of these methods are :The increasing overfitting risk when the number of observations is insufficient.

The significant computation time when the number of variables is large.

For example there are stepwise methods proposed in linear regression analysis such as: AIC, BIC and MSE for Artificial Neural Network.

Embedded methodsThe embedded methods incorporate feature selection as a part of the training process and are usually specific to given learning algorithms, and therefore may be more efficient than the first two categories.

Traditional machine learning algorithms like decision trees or artificial neural networks are examples of embedded approaches.

Hybrid methodsHybrid methods were proposed to combine the best properties of filters and wrappers.

First, a filter method is used to reduce the feature dimension space, that will be considered by the subsequent wrapper.

Then, a wrapper is employed to find the best candidate subset.

Hybrid methods usually achieve high accuracy that is characteristic to wrappers and high efficiency characteristic to filters [2].

In Tapreason we are using hybrid methods filters and embedded in various combinations to automatically find the feature selection method best for the problem at hand.

Methods of evaluating the effectiveness of the FSMFor the process of investigating the reaction of the prediction method to the feature selection method, we selected nine FSM and nine prediction methods (PM) and calculated the mean squared log error of the prediction.

In Table 1 you can see the mean squared log error for different PM with dependence of the FSM.

The color scales for each column separately from green (best) to red (worst).

According to table 1 the FSM using low variance is the worst FSM for most of the PM.

Naïve Bayes, according to the table 1 is the worst PM.

All nonzero FSM is a combination of all the features that were selected by at least one FSM.

All 5 nonzero FSM is a combination of all the features that were selected by at least five FSMs.

Table 1: feature selection method Vs prediction methodsThere are several ways to analyze the results shown on table 1 for evaluating the different Feature Selection Methods.

For example:Absolute minimum, best mean squared log error FSM/PM.

In table 1, the best results is Gradient Boosting (FSM) for Random Forest Regressor.

Ranking the FSM with the highest score in the column, chi2 test, LassoCV, All 5 nonzero features, all got the highest score twice.

Ranking the average of FM scores.

In table 1, it’s the All 5 nonzero features FSM.

Each method had selected a different FSM.

Thus, a fourth method was introduced:4.

Minimum ????, described in feature selection algorithm.

This method showed the most promising results.

In table 2, we show the ????.between the mean squared log error and the minimum mean squared log error of the PM.

Analyzing the resulting table using the average of the FSM rows X 1000, yields minimum at LSV : L1.

As you can see each column have one FSM with zero which is the minimum mean squared log error of the PM.

The FSM’s, LassoCV and all 5 nonzero, have three of the minimum mean squared log error of the PM (FSM.

Min) but none of them have the minimum average ????.

The FSM LSV : L1 have the minimum average ????.

Table 2: The FSM/PM table according ????.from the MinimumAccording to this analysis the Lasso Regression FSM is selected sinces it has the best mean squared log error for LSV : L1, out of the all the PMs.

The result of 0.

0384782765 is relatively close to the absolute minimum (Absolute.

Min) optimum of 0.


But further investigation uncovers PM with relatively worst results tend to skew the selection process towards the algorithms that performed best for them.

Thus, we’ve decided to counter this affect by adjusting the weight of the PMs according to their distance from the absolute minimum (Absolute.


The distance is calculated according to the following equation:1-PM.

MimPMAbsolute MinFSM.

Min (1)FSM.

Min — The minimum mean squared log error of the FSM.


Min — The minimum mean squared log error at the table FSM/PM.


Min — The minimum mean squared log error of the prediction method.

PM — The minimum mean squared log error.

Table 3: The FSM/PM table according to equation 1In table 3 we can see that for this example the final results does not chang.

But now FSMs with higher FSM.

Min have less influence (weight) on the average and there for, on the FSM selected.

This way we give a chance to all of the FSMs but not in away that can bias the results by large ????.from the Minimum.

The algorithmAt TapReason, we use dozens of different FSMs in search for the optimal algorithm per problem.

These FSMs include a wide variety of filter, embedded methods as well as ensembled FSMs that are constructed based on past performance.

The feature selection method is updated offline on a regular basis per client per problem.

The algorithm implemented is as follows:Calculate the mean squared log error (msle) for all FSM with all PMFor each PM select FSM/PM with minimum mean squared log error — FSM.

MinCalculate equation 1: 1-PM.

MimPMAbsolute MinFSM.

Min, for all FSM with all PMCalculate the average for each FSMSelect the FSM with the minimum average and set the FSM as the selected best FSMFuture researchWe know that certain prediction methods work better with certain feature selection methods.

Since in the presented process the FS method is selected first, the PM preference is not expressed in the process.

Thus, we can collect the information of the best FM for each PM use it after the PM is selected.









. More details

Leave a Reply