Feature Engineering and Selection (Book Review)

Data preparation is the process of transforming raw data into learning algorithms.

In some cases, data preparation is a required step in order to provide the data to an algorithm in its required input format.

In other cases, the most appropriate representation of the input data is not known and must be explored in a trial-and-error manner in order to discover what works best for a given model and dataset.

Max Kuhn and Kjell Johnson have written a new book focused on this important topic of data preparation and how to get the most out of your data on a predictive modeling project with machine learning algorithms.

The title of the book is “Feature Engineering and Selection: A Practical Approach for Predictive Models” and it was released in 2019.

In this post, you will discover my review and breakdown of the book “Feature Engineering and Selection” on the topic of data preparation for machine learning.

Let’s dive in!Feature Engineering and Selection (Book Review)This tutorial is divided into three parts; they are:“Feature Engineering and Selection: A Practical Approach for Predictive Models” is a book written by Max Kuhn and Kjell Johnson and published in 2019.

Kuhn and Johnson are the authors of one of my favorite books on practical machine learning titled “Applied Predictive Modeling,” published in 2013.

And Kuhn is also the author of the popular caret R package for machine learning.

As such, any book they publish, I will immediately buy and devour.

This new book is focused on the problem of data preparation for machine learning.

The authors highlight that although fitting and evaluating models is routine, achieving good performance for a predictive modeling problem is highly dependent upon how the data is prepared.

Despite our attempts to follow these good practices, we are sometimes frustrated to find that the best models have less-than-anticipated, less-than-useful useful predictive performance.

This lack of performance may be due to […] relevant predictors that were collected are represented in a way that models have trouble achieving good performance.

— Page xi, “Feature Engineering and Selection,” 2019.

They refer to the process of preparing data for modeling as “feature engineering.

”This is a slightly different definition than I am used to.

I would call it “data preparation” or “data preprocessing” and hold “feature engineering” apart as a subtask focused on systematic steps for creating new input variables from existing data.

Nevertheless, I see where they are coming from, as all data preparation could fit that definition.

Adjusting and reworking the predictors to enable models to better uncover predictor- response relationships has been termed feature engineering.

— Page xi, “Feature Engineering and Selection,” 2019.

They motivate the book by pointing out that we cannot know the most appropriate data representation to use in order to achieve the best predictive modeling performance.

That we may need to systematically test a suite of representations in order to discover what works best.

This matches the empirical approach that I recommend in general, which is rarely discussed, although comforting to see in a textbook.

… we often do not know the best re-representation of the predictors to improve model performance.

[…] we may need to search many alternative predictor representations to improve model performance.

— Page xii, “Feature Engineering and Selection,” 2019.

Given the importance of data preparation in order to achieve good performance on a dataset, the book is focused on highlighting specific data preparation techniques and how to use them.

The goals of Feature Engineering and Selection are to provide tools for re-representing predictors, to place these tools in the context of a good predictive modeling framework, and to convey our experience of utilizing these tools in practice.

— Page xii, “Feature Engineering and Selection,” 2019.

Like their previous book, all worked examples are in R, and in this case, the source code is available from the book’s GitHub project.

Also, unlike the previous book, the complete contents of the book are also available for free online:Next, let’s take a closer look at the topics covered by the book.

The book is divided into 12 chaptersThey are:Let’s take a closer look at each chapter.

The introductory chapter provides a good overview of the challenge of predictive modeling.

It starts by highlighting the important distinction between descriptive and predictive models.

… the prediction of a particular value (such as arrival time) reflects an estimation problem where our goal is not necessarily to understand if a trend or fact is genuine but is focused on having the most accurate determination of that value.

The uncertainty in the prediction is another important quantity, especially to gauge the trustworthiness of the value generated by the model.

— Page 1, “Feature Engineering and Selection,” 2019.

Importantly, the chapter emphasizes the need for data preparation in order to get the most out of a predictive model on a project.

The idea that there are different ways to represent predictors in a model, and that some of these representations are better than others, leads to the idea of feature engineering—the process of creating representations of data that increase the effectiveness of a model.

— Page 3, Feature Engineering and Selection, 2019.

As an introduction, a number of foundational topics are covered that you should probably already be familiar with, including:I really like that they point out the iterative nature of the predictive modeling process.

That it is not a single pass through data as is often discussed elsewhere.

When modeling data, there is almost never a single model fit or feature set that will immediately solve the problem.

The process is more likely to be a campaign of trial and error to achieve the best results.

— Page 16, “Feature Engineering and Selection,” 2019.

I also really like that they hammer home just how much the chosen representation of the input data impacts the performance of a model and we just cannot guess how well a given representation will allow a model to perform.

The effect of feature sets can be much larger than the effect of different models.

The interplay between models and features is complex and somewhat unpredictable.

— Page 16, “Feature Engineering and Selection,” 2019.

As the name suggests, this chapter aims to make the process of predictive modeling concrete with a worked example.

As a primer to feature engineering, an abbreviated example is presented with a modeling process […] For the purpose of illustration, this example will focus on exploration, analysis fit, and feature engineering, through the lens of a single model (logistic regression).

— Page 21, Feature Engineering and Selection, 2019.

This chapter reviews the process of predictive modeling with a focus on how and where data preparation fits into the process.

It covers concerns such as:The important takeaway from this chapter is that the application of data preparation in the process is critical, as a misapplication can result in data leakage and overfitting.

In order for any resampling scheme to produce performance estimates that generalize to new data, it must contain all of the steps in the modeling process that could significantly affect the model’s effectiveness.

— Pages 54-55, “Feature Engineering and Selection,” 2019.

The solution is to fit data preparation on the training dataset only, then apply the fit transforms on the test set and other datasets as needed.

This is a best practice in predictive modeling when using train/test splits and k-fold cross-validation.

To provide a solid methodology, we should constrain ourselves to developing the list of preprocessing techniques, estimate them only in the presence of the training data points, and then apply the techniques to future data (including the test set).

— Page 55, “Feature Engineering and Selection,” 2019.

This chapter focuses on an important step to perform prior to data preparation, namely take a close look at the data.

The authors suggest using data visualization techniques to first understand the target variable that is being predicted, then to focus on the input variables.

This information can then be used to inform the types of data preparation methods to explore.

One of the first steps of the exploratory data process when the ultimate purpose is to predict a response, is to create visualizations that help elucidate knowledge of the response and then to uncover relationships between the predictors and the response.

— Page 65, “Feature Engineering and Selection,” 2019.

This chapter focuses on alternate representations for categorical variables that summarize qualitative information.

Categorical or nominal predictors are those that contain qualitative data— Page 93, “Feature Engineering and Selection,” 2019.

Categorical variables may have a rank-order relationship (ordinal) or have no such relationship (nominal).

Simple categorical variables can also be classified as ordered or unordered.

[…] Ordered and unordered factors might require different approaches for including the embedded information in a model.

— Page 93, “Feature Engineering and Selection,” 2019.

This includes techniques such as dummy variables, hashing, and embeddings.

This chapter focuses on alternate representations for numerical variables that summarize qualitative information.

The objective of this chapter is to develop tools for converting these types of predictors into a form that a model can better utilize.

— Page 121, “Feature Engineering and Selection,” 2019.

There are many well-understood problems that we may observe with numerical variables such as:Interestingly, the authors represent a suite of techniques organized by the effect each method has on the input variable.

That is, whether the method operates on one input variable or many and produces a single result or many, e.


:This includes a host of methods, such as data scaling, power transforms, and projection methods.

This chapter focuses on a topic that is often overlooked, which is the study of how variables interact in a dataset.

Technically, interaction refers to those variables that together have more or less effect than if the variables are considered in isolation.

For many problems, additional variation in the response can be explained by the effect of two or more predictors working in conjunction with each other.

[…] More formally, two or more predictors are said to interact if their combined effect is different (less or greater) than what we would expect if we were to add the impact of each of their effects when considered alone.

— Page 157, “Feature Engineering and Selection,” 2019.

This topic is often overlooked in the context of data preparation, as it is often believed that the learning algorithms used in predictive modeling will learn any relevant interrelationships between the variables that assist in predicting the target variable.

This chapter focuses on the problem of missing observations in the available data.

It is an important topic because most data has missing or corrupt values, or it will if the dataset is scaled up.

Missing data are not rare in real data sets.

— Page 157, “Feature Engineering and Selection,” 2019.

After reviewing causes for missing data and data visualizations that help to understand the scope of missing values in a dataset, the chapter works through three main solutions:This chapter provides a case study of data preparation methods for profile data.

It might be poorly named, but has to do with data with dependencies at different scales, e.


how to do useful data prep with data at day/week/month scope on a given dataset (e.


hierarchical structures).

Since the goal is to make daily predictions, the profile of within-day weather measurements should be somehow summarized at the day level in a manner that preserves the potential predictive information.

For this example, daily features could include the mean or median of the numeric data and perhaps the range of values within a day.

— Page 205, “Feature Engineering and Selection,” 2019.

I found it entirely uninteresting, I’m afraid.

But I’m sure it would be the most interesting chapter to anyone currently working with this type of data.

This chapter motivates the need for feature selection as the selection of the most relevant inputs, not the target variable that is being predicted.

… some may not be relevant to the outcome.

[…] there is a genuine need to appropriately select predictors for modeling.

— Page 227, “Feature Engineering and Selection,” 2019.

In addition to lifting model performance, selecting fewer input variables can make the model more interpretable, although often at the cost of model skill.

This is a common trade-off seen in predictive modeling.

… there is often a trade-off between predictive performance and interpretability, and it is generally not possible to maximize both at the same time.

— Page 227, “Feature Engineering and Selection,” 2019.

A framework of three methods is used to organize feature selection methods, including:Feature selection methodologies fall into three general classes: intrinsic (or implicit) methods, filter methods, and wrapper methods.

— Page 228, “Feature Engineering and Selection,” 2019.

The remaining two chapters also focus on feature selection.

This chapter focuses on methods that evaluate features one at a time and then select subsets of features that score well.

This includes methods that calculate the strength of a statistical relationship between the input and the target and methods that reiteratively delete features from the dataset and evaluate a model each step.

A simple approach to identifying potentially predictively important features is to evaluate each feature individually.

[…] Simple filters are ideal for finding individual predictors.

However, this approach does not take into account the impact of multiple features together.

— Page 255, “Feature Engineering and Selection,” 2019.

This chapter focuses on global search algorithms that test different subsets of features based on the performance of the models fit on those features.

Global search methods can be an effective tool for investigating the predictor space and identifying subsets of predictors that are optimally related to the response.

[…] Although the global search approaches are usually effective at finding good feature sets, they are computationally taxing.

— Page 281, “Feature Engineering and Selection,” 2019.

This includes well-known global stochastic search algorithms such as simulated annealing and genetic algorithms.

I think this is the much needed missing textbook on data preparation.

I also think that if you are a serious machine learning practitioner, you need a copy.

If you are familiar with both R and Python for machine learning, the book highlights just how far libraries like Python/scikit-learn have to go to catch-up to the R/caret ecosystem.

When it comes to data preparation, I don’t think worked examples are as useful as they are when demonstrating algorithms.

Perhaps it is just me and my preference.

Given how different each dataset is in terms of number, type, and composition of features, demonstrating data preparation on standard datasets is not a helpful teaching aide.

What I would prefer is a more systematic coverage of the problems we may see in raw data when it comes to modeling and how each data preparation method addresses it.



I’d love a long catalog of methods, how they work, and when to use them rather than prose about each method.

Anyway, that is just me pushing hard on how to make the book better or an alternate vision of the material.

It’s a must-have, no doubt.

This section provides more resources on the topic if you are looking to go deeper.

In this post, you discovered my review and breakdown of the book Feature Engineering and Selection on the topic of data preparation for machine learning.

Have you read the book? Let me know what you think of it in the comments below.

Do you have any questions? Ask your questions in the comments below and I will do my best to answer.


Leave a Reply