Feature engineering

Feature engineeringDiogo RibeiroBlockedUnblockFollowFollowingApr 29Feature engineering is the process of transforming raw, unprocessed data into a set of targeted features that best represent your underlying machine learning problem.

Engineering thoughtful, optimized data is the vital first step.

In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition.

This is often one of the most valuable tasks a data scientist can do to improve model performance, for 3 big reasons:You can isolate and highlight key information, which helps your algorithms “focus” on what’s important.

You can bring in your own domain of expertise.

Most importantly, once you understand the “vocabulary” of feature engineering, you can bring in other people’s domain expertise!Before moving on, we just want to note that this is not an exhaustive compendium of all feature engineering because there are limitless possibilities for this step.

The good news is that this skill will naturally improve as you gain more experience.

Garbage in, garbage out.

I’m sure you’ve heard the phrase before.

It can apply to relationships, dieting, working out, job performance, you name it: in order to get the best results, you have to fully commit to the best practices.

Sure, it may sound simplistic, but it’s also true for machine learning projects.

The quality of your model’s predictive output will only be as good as the quality and focus of the data it receives.

The process of transforming raw, unprocessed data into a set of targeted features (or variables) that accurately represent your machine learning problem is called feature engineering.

At its most basic, the process entails answering four key questions:What are the essential properties of the problem we’re trying to solve?How do those properties interact with each other?How will those properties interact with the inherent strengths and limitations of our model?How can we augment our dataset so as to enhance the predictive performance of the AI?Though the exact steps involved in answering these questions differ for each machine learning project, here are 5 of the best practices to ensure you’re doing all you can to optimize your data management process.

Utilize Domain Expertise and Individual Creativity to Determine VariablesThe cornerstone of good Design Thinking also happens to be the cornerstone of good feature engineering: utilizing individual creativity and domain expertise in order to identify the important variables within your problem.

Feature Engineering is as much an art as a science.

Before even thinking about the models or algorithms or predictions, a team of domain experts and technologists must evaluate all the available variables and determine which of those variables will actually add value to your algorithm and which may result in noise or overfitting.

Use Indicator Variables to Isolate Important InformationMost machine learning algorithms can’t directly address categorical features, so you need to create indicator variables to represent independent options within a category.

For example, if you’re a rideshare startup studying transportation usage in a particular region, it makes sense to have a preferred mode of transportation feature.

Within that feature, you could create indicator variables to distinguish subjects who prefer driving, biking, walking, taking the train, etc.

Indicator variables are set to numerical values so that algebraic algorithms can optimally process these features.

Create Interaction Features to Highlight Variable RelationshipsThe next step in feature engineering is highlighting relevant interactions between two or more features.

It’s important when looking for opportunities not only to take the sum of variables, but also the product, difference, or quotient of those variables.

For example, going back to our transportation example, if you wanted to capture the interaction between travel frequency and mode of travel, you could create interaction features to highlight each of those intersecting data points.

This step requires experimentation and an openness to new relationships and correlations.

You do not want to limit relationships based on preconceived assumptions.

Part of the fun of using machine learning to analyze your data is to discover new and opportunities.

Combine or Remove Sparse Classes to Avoid Modeling ErrorsSparse classes are categories that have only a few data points.

These can be harmful for your machine learning algorithms as they may cause a modeling error called overfitness.

If you combine sparse variables into one variable (for example, an “other” variable), or remove them completely, this will unclutter your data and improve the ability to generalize the predictive capabilities of your AI.

This ensures that your AI is not skewing your results based on a few data points that are not relevant to new data.

Remove Irrelevant/Redundant FeaturesFinally, it’s useful to remove irrelevant or redundant features from your dataset.

Again, feature engineering is all about pre-processing data so your model will spend the minimum possible effort wading through the noise.

Removing irrelevant or redundant data points will help unclog the gears of your AI’s engine.

In SummaryIf the features of your data don’t accurately represent the predictive signals of your problem, there’s no amount of hyperparameter tuning or algorithmic tinkering that will salvage your model’s predictive ability.

Engineering thoughtful, optimized data is a vital first step to engineering thoughtful, optimized predictions.

Example in PythonDataTo illustrate what is possible, we will consider a simple transaction data set, one possibly generated from retail purchases.

Let’s say that we have simple transaction table, with a column identifying the customer, a column indicating the product that was purchased, a column for price and column containing the date and time the purchase was made.

Let us say this data is available in a CSV.

You could obtain such a data set from Kaggle’s Acquire Valued Shopper Challenge.

Look for the transactions data.

Note that in this data set, there is no price.

For this example, we will be using the Fa-Teng data set.

There are a number of other data sets for grocery/retail in Recsys.

Round 1: Basic FeaturesWhen we look at a date time stamp, a number of features, or pieces of information are immediately obvious:YearMonthDayDay of weekWeek of yearHour of dayMonth and day of the week can be quite useful in understanding periodicity or seasonality of transactions.

We may find that some actions are more probable on certain days of the week, or something happens around the same month every year.

With Halloween around the corner, for example, you are probably shopping for candy right now.

Using pandas we try and load this data set (you may have to remove the header row from the file):columns = ['date', 'customer', 'age', 'zipcode', 'product_class', 'product_id', 'amount', 'asset', 'price']txs = pd.

read_table('D11-02/D01', sep=';', header=None, names=columns)txs.

info() # to get summary statisticstxs.

head() # to get a feel for the dataUnfortunately, the timestamps in this dataset are useless.

I couldn’t find a realistic data set which has time stamp information.

Welcome to the real world, with imperfect data!.However, if you know of a good data set, I would love to hear from you!For the purpose of our feature engineering, let us just imagine that timestamps are available.

Now, let us start adding our first set of features to this data set.

from datetime import datetimeyear = lambda x: datetime.