How to Perform Feature Selection with Categorical Data

Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable.

Feature selection is often straightforward when working with real-valued data, such as using the Pearson’s correlation coefficient, but can be challenging when working with categorical data.

The two most commonly used feature selection methods for categorical input data when the target variable is also categorical (e.

g.

classification predictive modeling) are the chi-squared statistic and the mutual information statistic.

In this tutorial, you will discover how to perform feature selection with categorical input data.

After completing this tutorial, you will know:Let’s get started.

How to Perform Feature Selection with Categorical DataPhoto by Phil Dolby, some rights reserved.

This tutorial is divided into three parts; they are:As the basis of this tutorial, we will use the so-called “Breast cancer” dataset that has been widely studied as a machine learning dataset since the 1980s.

The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer.

There are 286 examples and nine input variables.

It is a binary classification problem.

A naive model can achieve an accuracy of 70% on this dataset.

A good score is about 76% +/- 3%.

We will aim for this region, but note that the models in this tutorial are not optimized; they are designed to demonstrate encoding schemes.

You can download the dataset and save the file as “breast-cancer.

csv” in your current working directory.

Looking at the data, we can see that all nine input variables are categorical.

Specifically, all variables are quoted strings; some are ordinal and some are not.

We can load this dataset into memory using the Pandas library.

Once loaded, we can split the columns into input (X) and output for modeling.

Finally, we can force all fields in the input data to be string, just in case Pandas tried to map some automatically to numbers (it does try).

We can tie all of this together into a helpful function that we can reuse later.

Once loaded, we can split the data into training and test sets so that we can fit and evaluate a learning model.

We will use the train_test_split() function form scikit-learn and use 67% of the data for training and 33% for testing.

Tying all of these elements together, the complete example of loading, splitting, and summarizing the raw categorical dataset is listed below.

Running the example reports the size of the input and output elements of the train and test sets.

We can see that we have 191 examples for training and 95 for testing.

Now that we are familiar with the dataset, let’s look at how we can encode it for modeling.

We can use the OrdinalEncoder() from scikit-learn to encode each variable to integers.

This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

Note: I will leave it as an exercise to you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

The function below named prepare_inputs() takes the input data for the train and test sets and encodes it using an ordinal encoding.

We also need to prepare the target variable.

It is a binary classification problem, so we need to map the two class labels to 0 and 1.

This is a type of ordinal encoding, and scikit-learn provides the LabelEncoder class specifically designed for this purpose.

We could just as easily use the OrdinalEncoder and achieve the same result, although the LabelEncoder is designed for encoding a single variable.

The prepare_targets() function integer encodes the output data for the train and test sets.

We can call these functions to prepare our data.

Tying this all together, the complete example of loading and encoding the input and output variables for the breast cancer categorical dataset is listed below.

Now that we have loaded and prepared the breast cancer dataset, we can explore feature selection.

There are two popular feature selection techniques that can be used for categorical input data and a categorical (class) target variable.

They are:Let’s take a closer look at each in turn.

Pearson’s chi-squared statistical hypothesis test is an example of a test for independence between categorical variables.

You can learn more about this statistical test in the tutorial:The results of this test can be used for feature selection, where those features that are independent of the target variable can be removed from the dataset.

The scikit-learn machine library provides an implementation of the chi-squared test in the chi2() function.

This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

For example, we can define the SelectKBest class to use the chi2() function and select all features, then transform the train and test sets.

We can then print the scores for each variable (largest is better), and plot the scores for each variable as a bar graph to get an idea of how many features we should select.

Tying this together with the data preparation for the breast cancer dataset in the previous section, the complete example is listed below.

Running the example first prints the scores calculated for each input feature and the target variable.

Note: your specific results may differ.

Try running the example a few times.

In this case, we can see the scores are small and it is hard to get an idea from the number alone as to which features are more relevant.

Perhaps features 3, 4, 5, and 8 are most relevant.

A bar chart of the feature importance scores for each input feature is created.

This clearly shows that feature 3 might be the most relevant (according to chi-squared) and that perhaps four of the nine input features are the most relevant.

We could set k=4 When configuring the SelectKBest to select these top four features.

Bar Chart of the Input Features (x) vs The Chi-Squared Feature Importance (y)Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.

You can learn more about mutual information in the following tutorial.

The scikit-learn machine learning library provides an implementation of mutual information for feature selection via the mutual_info_classif() function.

Like chi2(), it can be used in the SelectKBest feature selection strategy (and other strategies).

We can perform feature selection using mutual information on the breast cancer set and print and plot the scores (larger is better) as we did in the previous section.

The complete example of using mutual information for categorical feature selection is listed below.

Running the example first prints the scores calculated for each input feature and the target variable.

Note: your specific results may differ.

Try running the example a few times.

In this case, we can see that some of the features have a very low score, suggesting that perhaps they can be removed.

Perhaps features 3, 6, 2, and 5 are most relevant.

A bar chart of the feature importance scores for each input feature is created.

Importantly, a different mixture of features is promoted.

Bar Chart of the Input Features (x) vs The Mutual Information Feature Importance (y)Now that we know how to perform feature selection on categorical data for a classification predictive modeling problem, we can try developing a model using the selected features and compare the results.

There are many different techniques for scoring features and selecting features based on scores; how do you know which one to use?A robust approach is to evaluate models using different feature selection methods (and numbers of features) and select the method that results in a model with the best performance.

In this section, we will evaluate a Logistic Regression model with all features compared to a model built from features selected by chi-squared and those features selected via mutual information.

Logistic regression is a good model for testing feature selection methods as it can perform better if irrelevant features are removed from the model.

As a first step, we will evaluate a LogisticRegression model using all the available features.

The model is fit on the training dataset and evaluated on the test dataset.

The complete example is listed below.

Running the example prints the accuracy of the model on the training dataset.

Note: your specific results may vary given the stochastic nature of the learning algorithm.

Try running the example a few times.

In this case, we can see that the model achieves a classification accuracy of about 75%.

We would prefer to use a subset of features that achieves a classification accuracy that is as good or better than this.

We can use the chi-squared test to score the features and select the four most relevant features.

The select_features() function below is updated to achieve this.

The complete example of evaluating a logistic regression model fit and evaluated on data using this feature selection method is listed below.

Running the example reports the performance of the model on just four of the nine input features selected using the chi-squared statistic.

Note: your specific results may vary given the stochastic nature of the learning algorithm.

Try running the example a few times.

In this case, we see that the model achieved an accuracy of about 74%, a slight drop in performance.

It is possible that some of the features removed are, in fact, adding value directly or in concert with the selected features.

At this stage, we would probably prefer to use all of the input features.

We can repeat the experiment and select the top four features using a mutual information statistic.

The updated version of the select_features() function to achieve this is listed below.

The complete example of using mutual information for feature selection to fit a logistic regression model is listed below.

Running the example fits the model on the four top selected features chosen using mutual information.

Note: your specific results may vary given the stochastic nature of the learning algorithm.

Try running the example a few times.

In this case, we can see a small lift in classification accuracy to 76%.

To be sure that the effect is real, it would be a good idea to repeat each experiment multiple times and compare the mean performance.

It may also be a good idea to explore using k-fold cross-validation instead of a simple train/test split.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to perform feature selection with categorical input data.

Specifically, you learned:Do you have any questions?.Ask your questions in the comments below and I will do my best to answer.

with just a few lines of scikit-learn codeLearn how in my new Ebook: Machine Learning Mastery With PythonCovers self-study tutorials and end-to-end projects like: Loading data, visualization, modeling, tuning, and much more.

Skip the Academics.

Just Results.

.

. More details

Leave a Reply