Dealing with Categorical Data fast — an example

Dealing with Categorical Data fast — an exampleSamir GadkariBlockedUnblockFollowFollowingFeb 7You’re in the office at 9 AM.

Your boss comes in, gives you some data, and asks you to create a model by 12 noon.

There is a meeting in which the model will be presented.

What do you do?We will look at an example dataset from a private Kaggle competition, create some quick models and pick one.

The full github repository is here.

We’re given the training dataset (both features and target).

We’re also given the test features dataset, and asked to predict the test target.

To test your predictions, you create a predictions file and upload it to Kaggle.

Kaggle will then give you a score (value from 0 to 1).

The higher the value, the better your prediction.

We will focus on the accuracy score, as that is what we will be tested by Kaggle for this competition.

Import the classes you need first in your Jupyter notebook.

Keep this block separate, as you can add more libraries to it and execute it by itself.

import numpy as npimport pandas as pdfrom sklearn.

metrics import accuracy_scorefrom sklearn.

model_selection import train_test_splitfrom sklearn.

model_selection import cross_val_score, GridSearchCVfrom sklearn.

linear_model import LogisticRegressionfrom sklearn.

tree import DecisionTreeClassifierfrom sklearn.

ensemble import RandomForestClassifierpd.

set_option('display.

max_columns', None) # Unlimited columns.

pd.

options.

mode.

use_inf_as_na = True # Any inf or -inf is # treated as NA.

Read in the training features data:X_train_original = pd.

read_csv('.

/train_features.

csv', header = [0], # Top row is header.

index_col = 0) # First col is index.

X_train_original.

head()Read in the training target data:y_train_original = pd.

read_csv('.

/train_labels.

csv', header = [0], # Top row is header.

index_col = 0) # First col is index.

y_train_original.

head()Your target is categorical.

Let’s see how many categories it has:pd.

value_counts(y_train_original.

status_group, normalize = True)Since more than half is just one category, we can predict that all of our target values are ‘functional’.

This will give us an accuracy of 0.

54 on the training dataset.

Let’s see what it does on the testing dataset.

Majority class predictionThe reason we do a majority class prediction is to gauge how good our future prediction scores should be.

It gives us a baseline that we want to cross with our next model.

Let’s look at the test features first:X_test_original = pd.

read_csv('.

/test_features.

csv', header = [0], index_col = 0)X_test_original.

shape(14358, 39)This shape shows us that we need 14358 values in our prediction output (one for each row of the input).

So we create an array with the required number of rows, and the value ‘functional’:y_pred = ['functional'] * len(X_test_original)y_pred = pd.

DataFrame(data = y_pred, index = X_test_original.

index.

values, columns = ['status_group'])y_pred.

head()Then we write it out to a file and import in into Kaggle.

Kaggle scored it at 0.

53 accuracy (which is about what we expect).

The difference is just because the test dataset doesn’t contain the same exact proportion of target class values as the training dataset.

Predict with just the numerical featuresX_train_numerical = X_train_original.

select_dtypes( include = np.

number).

copy()Convert ‘date_recorded’ field into ‘days_since_epoch’.

In computer programming, the epoch is considered the first day of January 1970 for unix computers.

It’s just a convention that is usually used — we could have used any day here.

For machine learning, we only care that the relative proportions of the values are the same.

days_since_epoch = pd.

to_datetime(X_train_original['date_recorded']) – pd.

datetime(1970, 1, 1)X_train_numerical['days_since_epoch'] = days_since_epoch.

dt.

daysX_train_numerical.

head()X_train_numerical_indices = X_train_numerical.

index.

valuesy_train_numerical = y_train_original[y_train_original.

index.

isin(X_train_numerical_indices)]Logistic RegressionLet’s try a LogisticRegression classifier:cv_score = cross_val_score(LogisticRegression(), X_train_numerical, y_train_numerical, scoring = 'accuracy', cv = 3, n_jobs = -1, verbose = 1)cv_scoreLogistic Regression gives us a score of 0.

55.

Not much different than the Majority Class model.

Decision treeHow about a Decision Tree classifier:clf = DecisionTreeClassifier()cv_score = cross_val_score(clf, X_train_numerical, y_train_numerical, scoring = 'accuracy', cv = 3, n_jobs = -1, verbose = 1)cv_scoreThis score is much better at 0.

65.

Let’s get the predictions for the test dataset and write it out to a file.

We can then submit it to Kaggle:clf.

fit(X_train_numerical, y_train_numerical)X_test_numerical = X_test_original.

select_dtypes(include = np.

number).

copy()days_since_epoch = pd.

to_datetime(X_test_original['date_recorded']) – pd.

datetime(1970, 1, 1)X_test_numerical['days_since_epoch'] = days_since_epoch.

dt.

daysy_pred = clf.

predict(X_test_numerical)y_pred = pd.

DataFrame(data = y_pred, index = X_test_numerical.

index.

values, columns = ['status_group'])y_pred.

to_csv('.

/decision_tree_pred.

csv', header = ['status_group'], index = True, index_label = 'id')Check data for missing or unusual valuesX_train_original.

isnull().

sum()Seven of the 39 features have null values.

Let’s drop those features:X_non_nulls = X_train_original.

dropna(axis = 1)Let’s find out how many unique values are there in each feature:X_non_nulls.

nunique().

sort_values(ascending = True)According to this article, the Decision Tree classifier is faster when categorical values are encoded numeric or binary.

Let’s encode non-null columns that have < 50 unique values, add the numerical columns to that dataframe, and run a Decision Tree classifier with various depths.

X_selected = X_non_nulls.

loc[:, X_non_nulls.

nunique().

sort_values() < 50]cat_cols = list(X_selected.

select_dtypes(['object']).

columns.

values)X_categorical = X_selected[cat_cols].

apply(lambda x: x.

astype('category').

cat.

codes)X_train_selected = X_train_numerical.

join(X_categorical)clf = DecisionTreeClassifier()cv_score = cross_val_score(clf, X_train_selected, y_train_original, scoring = 'accuracy', cv = 3, n_jobs = -1, verbose = 1)cv_scoreThis gives us a score of 0.

75.

This is the training score, so we should apply the same classifier to the test data and ask Kaggle to evaluate it for accuracy:clf.

fit(X_train_selected, y_train_original)X_test_non_nulls = X_test_original.

dropna(axis = 1)X_test_selected = X_test_non_nulls.

loc[:, X_test_non_nulls.

nunique().

sort_values() < 50]cat_cols = list(X_test_selected.

select_dtypes(['object']).

columns.

values)X_test_categorical = X_test_selected[cat_cols].

apply(lambda x: x.

astype('category').

cat.

codes)X_test_selected = X_test_numerical.

join(X_test_categorical)y_pred = clf.

predict(X_test_selected)y_pred = pd.

DataFrame(data = y_pred, index = X_test_selected.

index.

values, columns = ['status_group'])The test dataset gave us a score of 0.

76, which is higher because our model must have fit the test dataset a little better than the training dataset.

Still around the same value, which is to be expected.

Since our Decision Tree gave us a good result, let’s try the Random Forest classifierRandom Forest classifiers are good for multinomial targets (targets with multiple categorical values).

This classifier takes random samples from the training dataset, so there is no need to do cross validation on it.

We may do GridSearchCV to try different n_estimators and max_depth (if our score is not very good).

A Random Forest classifier consists of a lot of decision trees.

Each tree is created by randomly selecting a feature at each node of the tree, from the entire feature list.

The amount of trees gives the Random Forest classifier less bias compared to the single Decision Tree classifier.

X_train, X_test, y_train, y_test = train_test_split( X_train_selected, y_train_original, test_size=0.

2)clf = RandomForestClassifier()clf.

fit(X_train, y_train)clf.

score(X_test, y_test)The Decision Tree classifier gives us a score of 0.

79.

Good, but not as high a jump as before.

This is what we usually find — earlier models usually have lower scores, and can be easily trumped, but later models are hard to trump.

We’re not done yet.

We will search for the best Random Forest classifier using a grid search:param_grid = { 'n_estimators': [10, 20, 30], 'max_depth': [6, 10, 20, 30]}gridsearch = GridSearchCV(RandomForestClassifier(n_jobs = -1), param_grid=param_grid, scoring='accuracy', cv=3, return_train_score=True, verbose=10)gridsearch.

fit(X_train, y_train)The param_grid is a dictionary of the parameters required by the classifier.

If you’re uncertain what to put in this dictionary, use this function call that will give you a list of the parameters you can use:RandomForestClassifier().

get_params().

keys()Inside the GridSearchCV function, we create a RandomForestClassifier object with n_jobs = -1.

This will allow us to use all the cores on our machine, thus making this job run faster.

The variable ‘cv’ gives the number of cross-validation folds that this grid search should use.

cv = 3 will split our data into 3 equal parts, then use two of them for training the RandomForest classifier, and test with the remaining data.

It will keep doing this until all combinations are exhausted.

The verbose value will tell grid search how much information to print.

A larger value prints more information.

With a value of 10, you will see each combination of variable values specified in the param_grid dictionary printed along with the iteration number of the test/train splits.

You will also see the score obtained on the test portion of the data.

You don’t have to read all of this — there is a summary that we will print out that is easier to read:pd.

DataFrame(gridsearch.

cv_results_).

sort_values( by='rank_test_score')The top row of this dataframe shows the param_grid options that gave the best score on the test portion of the data.

This is shown in the mean_test_score column where our score is 0.

79.

This is the same as the Decision Tree classifier.

Let’s run this on the Kaggle test set:clf = RandomForestClassifier(max_depth = 20, n_estimators = 30, n_jobs = -1)clf.

fit(X_train, y_train)clf.

score(X_test, y_test)And we get a score of 0.

81.

Which is not much different from the Decision Tree classifier score of 0.

79.

The difference is that the Decision Tree is biased, but the Random Forest is not.

If you test this Random Forest classifier on multiple sets of new test data, you will find that it will do better than the Decision Tree classifier.

ConclusionNow that you know a Random Forest is better than a Decision Tree, maybe you can use these steps to get to a solution faster:Always, always, do a fast prediction first.

For classification problem like this one, if there is one majority class in the target, a majority class prediction will be a good start.

If there are few nulls (or if they’re in only certain features), drop the observations/features.

Drop categorical features that have a high number of values.

They probably won’t make good features.

Also drop features that have a single value since they cannot discriminate between classes.

Convert dates to days or seconds (for more precision).

Most classifiers work with numbers, so it’s good to give them all numbers.

Convert categorical columns to numbers.

Instead of running a Decision Tree classifier, since it is biased, just run a Grid Search with a Random Forest classifier.

.

. More details

Leave a Reply