Feature engineering

Feature engineeringDiogo RibeiroBlockedUnblockFollowFollowingApr 29Feature engineering is the process of transforming raw, unprocessed data into a set of targeted features that best represent your underlying machine learning problem.

Engineering thoughtful, optimized data is the vital first step.

In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition.

This is often one of the most valuable tasks a data scientist can do to improve model performance, for 3 big reasons:You can isolate and highlight key information, which helps your algorithms “focus” on what’s important.

You can bring in your own domain of expertise.

Most importantly, once you understand the “vocabulary” of feature engineering, you can bring in other people’s domain expertise!Before moving on, we just want to note that this is not an exhaustive compendium of all feature engineering because there are limitless possibilities for this step.

The good news is that this skill will naturally improve as you gain more experience.

Garbage in, garbage out.

I’m sure you’ve heard the phrase before.

It can apply to relationships, dieting, working out, job performance, you name it: in order to get the best results, you have to fully commit to the best practices.

Sure, it may sound simplistic, but it’s also true for machine learning projects.

The quality of your model’s predictive output will only be as good as the quality and focus of the data it receives.

The process of transforming raw, unprocessed data into a set of targeted features (or variables) that accurately represent your machine learning problem is called feature engineering.

At its most basic, the process entails answering four key questions:What are the essential properties of the problem we’re trying to solve?How do those properties interact with each other?How will those properties interact with the inherent strengths and limitations of our model?How can we augment our dataset so as to enhance the predictive performance of the AI?Though the exact steps involved in answering these questions differ for each machine learning project, here are 5 of the best practices to ensure you’re doing all you can to optimize your data management process.

1.

Utilize Domain Expertise and Individual Creativity to Determine VariablesThe cornerstone of good Design Thinking also happens to be the cornerstone of good feature engineering: utilizing individual creativity and domain expertise in order to identify the important variables within your problem.

Feature Engineering is as much an art as a science.

Before even thinking about the models or algorithms or predictions, a team of domain experts and technologists must evaluate all the available variables and determine which of those variables will actually add value to your algorithm and which may result in noise or overfitting.

2.

Use Indicator Variables to Isolate Important InformationMost machine learning algorithms can’t directly address categorical features, so you need to create indicator variables to represent independent options within a category.

For example, if you’re a rideshare startup studying transportation usage in a particular region, it makes sense to have a preferred mode of transportation feature.

Within that feature, you could create indicator variables to distinguish subjects who prefer driving, biking, walking, taking the train, etc.

Indicator variables are set to numerical values so that algebraic algorithms can optimally process these features.

3.

Create Interaction Features to Highlight Variable RelationshipsThe next step in feature engineering is highlighting relevant interactions between two or more features.

It’s important when looking for opportunities not only to take the sum of variables, but also the product, difference, or quotient of those variables.

For example, going back to our transportation example, if you wanted to capture the interaction between travel frequency and mode of travel, you could create interaction features to highlight each of those intersecting data points.

This step requires experimentation and an openness to new relationships and correlations.

You do not want to limit relationships based on preconceived assumptions.

Part of the fun of using machine learning to analyze your data is to discover new and opportunities.

4.

Combine or Remove Sparse Classes to Avoid Modeling ErrorsSparse classes are categories that have only a few data points.

These can be harmful for your machine learning algorithms as they may cause a modeling error called overfitness.

If you combine sparse variables into one variable (for example, an “other” variable), or remove them completely, this will unclutter your data and improve the ability to generalize the predictive capabilities of your AI.

This ensures that your AI is not skewing your results based on a few data points that are not relevant to new data.

5.

Remove Irrelevant/Redundant FeaturesFinally, it’s useful to remove irrelevant or redundant features from your dataset.

Again, feature engineering is all about pre-processing data so your model will spend the minimum possible effort wading through the noise.

Removing irrelevant or redundant data points will help unclog the gears of your AI’s engine.

In SummaryIf the features of your data don’t accurately represent the predictive signals of your problem, there’s no amount of hyperparameter tuning or algorithmic tinkering that will salvage your model’s predictive ability.

Engineering thoughtful, optimized data is a vital first step to engineering thoughtful, optimized predictions.

Example in PythonDataTo illustrate what is possible, we will consider a simple transaction data set, one possibly generated from retail purchases.

Let’s say that we have simple transaction table, with a column identifying the customer, a column indicating the product that was purchased, a column for price and column containing the date and time the purchase was made.

Let us say this data is available in a CSV.

You could obtain such a data set from Kaggle’s Acquire Valued Shopper Challenge.

Look for the transactions data.

Note that in this data set, there is no price.

For this example, we will be using the Fa-Teng data set.

There are a number of other data sets for grocery/retail in Recsys.

Round 1: Basic FeaturesWhen we look at a date time stamp, a number of features, or pieces of information are immediately obvious:YearMonthDayDay of weekWeek of yearHour of dayMonth and day of the week can be quite useful in understanding periodicity or seasonality of transactions.

We may find that some actions are more probable on certain days of the week, or something happens around the same month every year.

With Halloween around the corner, for example, you are probably shopping for candy right now.

Using pandas we try and load this data set (you may have to remove the header row from the file):columns = ['date', 'customer', 'age', 'zipcode', 'product_class', 'product_id', 'amount', 'asset', 'price']txs = pd.

read_table('D11-02/D01', sep=';', header=None, names=columns)txs.

info() # to get summary statisticstxs.

head() # to get a feel for the dataUnfortunately, the timestamps in this dataset are useless.

I couldn’t find a realistic data set which has time stamp information.

Welcome to the real world, with imperfect data!.However, if you know of a good data set, I would love to hear from you!For the purpose of our feature engineering, let us just imagine that timestamps are available.

Now, let us start adding our first set of features to this data set.

from datetime import datetimeyear = lambda x: datetime.

strptime(x, "%Y-%m-%d %H:%M:%S" ).

yeartxs['year'] = txs['date'].

map(year)txs.

head()You can see here that the feature was added to the DataFrame.

Here are some other map functions you could use:day_of_week = lambda x: datetime.

strptime(x, "%Y-%m-%d %H:%M:%S" ).

weekday()month = lambda x: datetime.

strptime(x, "%Y-%m-%d %H:%M:%S" ).

month# please read docs on how week numbers are calculateweek_number = lambda x: datetime.

strptime(x, "%Y-%m-%d %H:%M:%S" ).

strftime('%V')You can try writing some of the other features we mentioned above yourself.

See, with such simple code, we just added 7 new features!Round 2: More Interesting FeaturesNow, let's think of more interesting features that may involve looks.

How about seasons, or times of days?.Here are some example maps that you could run:seasons = [0,0,1,1,1,2,2,2,3,3,3,0] #dec – feb is winter, then spring, summer, fall etcseason = lambda x: seasons[(datetime.

strptime(x, "%Y-%m-%d %H:%M:%S" ).

month-1)]# sleep: 12-5, 6-9: breakfast, 10-14: lunch, 14-17: dinner prep, 17-21: dinner, 21-23: deserts!times_of_day = [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5 ]time_of_day = lambda x: times_of_day[datetime.

strptime(x, "%Y-%m-%d %H:%M:%S").

hour]We used this time of day map to understand when people see breakfast recipe, kids lunch box recipes, appetizers, etc.

Intuitively, you can imagine that people prepare for next day’s lunch and breakfast around or after dinner, especially if you have children.

If you coming to the end of your workday, you are probably thinking about dinner and what you could pick up on your way home.

This feature was extracted from clickstream to enrich the data and give additional insight into what types of recipes to show at what time.

The season was also a good predictor to understand which recipes are timeless and which are more seasonal.

For another grocery client, we saw a huge uptick in browsing flyers between 8 and 10 am and noon to 1 pm (lunch hour) during weekdays.

Further, Wednesday and Thursdays were heaviest traffic days of the week as most people plan for grocery shopping just before the weekend.

Distance between HolidaysRetail is a very seasonal business.

Often, people are buying for an occasion or near an occasion.

Intuitively, people may be purchasing for Valentine’s day, Thanksgiving due to all the great sales, Christmas, etc.

To understand which customers are more driven by these special occurrences, a set of features need to be created that measures distances to these occurrences.

Following steps are required to make this work:Pull a list of holidays/occurrences from a data source/API for a given geographyCreate a pandas dataframe of theseCreate new columns in the transaction data frame that computes the distance between the transaction date and holiday datePull a list of holidays and create a DataFrameThere are a number of public sources like Wikipedia or sites that provide an API like timeanddate.

com.

For the purpose of this article, let's assume that all transactions are from U.

S.

Further, instead of using API, let's scrape this data from here.

The code will use Beautiful Soup to extract the data and create a list of dictionary objects that can be loaded into pandas DataFrame.

You could also create a CSV and load the CSV into a data frame.

Using this CSV approach lets you build a service that runs periodically to create this CSV and doesn’t slow down the actual processing/feature creation.

You don’t want to hit the API every time you need holiday information as it is essentially static.

For sake of simplicity, the list of dictionaries are loaded straight into a data frame.

from bs4 import BeautifulSoupimport urllib2page=urllib2.

urlopen("http://www.

timeanddate.

com/holidays/us/2015?hol=16#!hol=49")holidays = BeautifulSoup(page.

read())print holidays.

titleHolidays and observances in United States in 2015At this point, a representation of the page has been loaded into memory.

The structure of the page needs to be looked at to determine the right element to target.

Fortunately, the page exposes the entire list of holidays in a table.

Column titled Holiday Type will be used to filter all the values.

For the purpose of this article, rows with the following types will be used:National HolidayObservanceNow, to get to the data in the table, use Chrome to highlight the table element, right click to Inspect element, and then right click on the element in the HTML code and select Copy CSS Path to get the reference of the table:import string # for translationtable = string.

maketrans("", "") # for removing punctuation etcrows = holidays.

select("body > div > div.

main-content-div > div.

fixed > table > tbody > tr")holidays_list = []for row in rows:cols = list(row)day = cols[0].

string # first col is the dayholiday_type = cols[3].

stringname = cols[2].

stringif holiday_type is not None:if "national holiday" in holiday_type.

lower():print day, name # purely to debugholiday = {}holiday['name'] = str(name).

translate(table, string.

punctuation+" ")holiday['day'] = str(day) + ", 2001" # since all transactions are from 2001holidays_list.

append(holiday)# now convert to data frameholidays_frame = pd.

DataFrame(holidays_list)To keep things simple, only national holidays were selected.

This list could be expanded to other types of holidays — this is left as an exercise for the reader.

Create distance featuresThe logic for creating these features is to take each row in the transaction table, compare the date of the transaction to every row in the holidays_framecreated above, and compute the number of days ahead or behind that particular holiday.

new_frame = pd.

DataFrame(holidays_list, index=holidays_frame['name']) # to help with the index selectiondef compute_date_diff(x, y):# convert x into date, y into date, compute date diffdate_x = datetime.

strptime(x, "%Y-%m-%d %H:%M:%S" )date_y = datetime.

strptime(y, "%b %d, %Y")return (date_y – date_x).

daysfor holiday in list(new_frame.

index):day = new_frame.

loc[holiday, 'day']print daytxs[holiday] = txs['date'].

apply(compute_date_diff, args=(day,))txs[['date'] + list(holidays_frame['name'])].

head()It is easy to see how weather information from public APIs could also be added in, if there was location information, through customer address information or store information, was available.

If you were trying to build a regression model to predict the amount (dollar value) of purchases given a customer, product and date, you could train the model on 15 additional features that were just created!Computing preferences for customersAnother interesting thing that could be done just with this data is to see which customers have a preference for a certain season, or around Valentines day or Christmas.

cust_xmas = txs.

groupby('customer')['ChristmasDay'].

mean()cust_xmas.

order()The data set used was not rich enough to have a wider variety of dates for transactions, but in a more real-world scenario, you would see how this would play out.

Feel free to try with product class, or products.

You can also combine columns like the customer-product class to see if there is a specific preference for a customer for a given product class.

ConclusionIn this article, we converted a simple datetime column into over 15 columns!!!.There is more information in that column that has not been teased out.

For example, days between purchases per customer could be created.

Then, this difference could be subtracted from a global average of days between purchases to determine if a customer purchases more often.

A trendily could also be created for a given customer suggesting how often generally a customer purchases and calculate a probability they have churned if they don’t purchase for a given number of days.

There are many such measures that could be still extracted.

So go forth, and feature engineer!.

. More details

Leave a Reply