An Overview of Categorical Input Handling for Neural Networks

“Green, blue, sky, leaf”.

Categorical data can, but doesn’t have to, follow some sort of ordering.

Often however, categorical data may be distinct or even overlapping.

Sometimes possible values of a category set can have relations between each other, sometimes they have nothing to do with each other whatsoever.

All of these concepts need to be kept in mind when transferring categorical input into a numeric vector space.

In the words of Wikipedia:a categorical variable is a variable that can take on one of a limited, and usually fixed number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.

Okay enough taking credit for other peoples work.

Let’s get into it:Domain agnostic solutionsThe following ways of encoding categorical data are agnostic to the type of categories we’re interacting with.

Ordinal EncodingLet’s start with the simplest form: Assigning each possible category an integer and pass it along.

This is an enormously naive way of handling the data and it usually serves no good other than to make it work, meaning the program won’t crash anymore.

When looking at the country column, one may then expect something like this:There are several downsides to this approach: Canada != 1/2 * China and Vietnam != 40/39*United-States .

Any higher-level information about these countries is lost in translation.

The results show less performance when using my benchmark predictor network:>>>> ordinal_pp = helpers.

to_pipeline(("ordinalencoder", preprocessing.

OrdinalEncoder()))>>>> ordinal_pipeline = pipeline.

get_pipeline(ordinal_pp, input_dim=14)>>>> helpers.

execute_on_pipeline(ordinal_pipeline, X_train, y_train, X_test, y_test)Epoch 1/3 – 1s – loss: 0.

9855 – mean_absolute_error: 0.

3605 – acc: 0.

7362Epoch 2/3 – 1s – loss: 0.

4939 – mean_absolute_error: 0.

3108 – acc: 0.

7741Epoch 3/3 – 1s – loss: 0.

4665 – mean_absolute_error: 0.

2840 – acc: 0.

79700.

8492152641473572One Hot EncodingThe one that I found most often to be the “recommended approach” is OHE, also called “Dummy Encoding”.

It’s explained on nearly every page that pops up when searching for “categorical data neural networks”.

It’s also part of sklearn and therefore very quick to apply to a dataset.

The principle is simple and best shown with a bit of code:>>>> import helpers>>>> from sklearn import preprocessing>>>> import numpy as np>>>> X_test, y_test = helpers.

get_data(subset="test")>>>> ohe = preprocessing.

OneHotEncoder()>>>> categories = np.

array(list(set(X_test['workclass'].

astype(str).

values))).

reshape(-1,1)>>>> ohe.

fit(categories)OneHotEncoder(categorical_features=None, categories=None, dtype=<class 'numpy.

float64'>, handle_unknown='error', n_values=None, sparse=True)>>>> categoriesarray([['Self-emp-inc'], ['Local-gov'], ['Private'], ['State-gov'], ['Never-worked'], ['Without-pay'], ['Federal-gov'], ['Self-emp-not-inc'], ['nan']], dtype='<U16')>>>> ohe.

transform(categories).

todense()matrix([[0.

, 0.

, 0.

, 0.

, 1.

, 0.

, 0.

, 0.

, 0.

], [0.

, 1.

, 0.

, 0.

, 0.

, 0.

, 0.

, 0.

, 0.

], [0.

, 0.

, 0.

, 1.

, 0.

, 0.

, 0.

, 0.

, 0.

], [0.

, 0.

, 0.

, 0.

, 0.

, 0.

, 1.

, 0.

, 0.

], [0.

, 0.

, 1.

, 0.

, 0.

, 0.

, 0.

, 0.

, 0.

], [0.

, 0.

, 0.

, 0.

, 0.

, 0.

, 0.

, 1.

, 0.

], [1.

, 0.

, 0.

, 0.

, 0.

, 0.

, 0.

, 0.

, 0.

], [0.

, 0.

, 0.

, 0.

, 0.

, 1.

, 0.

, 0.

, 0.

], [0.

, 0.

, 0.

, 0.

, 0.

, 0.

, 0.

, 0.

, 1.

]])Excuse me for not having it shared as a gist.

For some reason, medium turns my pasted gist links into screenshots of the gist.

Very uselessThe results are an improvement over the previous variant:>>> ohe_encoder_pp = helpers.

to_pipeline(("ohe", preprocessing.

OneHotEncoder(handle_unknown='ignore', categories=categories.

get_categories())))>>> ohe_pipeline = pipeline.

get_pipeline(ohe_encoder_pp, input_dim=112)>>> helpers.

execute_on_pipeline(ohe_pipeline, X_train, y_train, X_test, y_test)Epoch 1/3 – 2s – loss: 0.

3824 – mean_absolute_error: 0.

2332 – acc: 0.

8358Epoch 2/3 – 1s – loss: 0.

3601 – mean_absolute_error: 0.

2117 – acc: 0.

8530Epoch 3/3 – 1s – loss: 0.

3547 – mean_absolute_error: 0.

2125 – acc: 0.

85260.

9069985244122271Embedding Categorical dataThis paper and this post about it both describe how to turn tabular data into something a neural network can manage.

Source: https://tech.

instacart.

com/deep-learning-with-emojis-not-math-660ba1ad6cdcWhat this does is allow you to pre-train train an embedding layer for each category that you intend to convert into something a NN can consume.

In the graphic above, the instacart team used an embedding layer to convert any of their 10 million products into a 10 dimensional embedding.

The following hidden layers then only need to handle a much smaller input size.

Also, these lower dimensions are then of fixed size which is important for building models, as the input size of the first layer needs to be set during training time and the later prediction values must also adhere to this size.

The concept is very similar to natural language processing approaches and we may even make use of the pre-trained embeddings.

More on that laterWhat does this mean for our categorical values?.Well, for each categorical column in the dataset, we’d have to create an embedding network that learns embeddings for that category.

If the categorical column contains words from a pre-trained embedding vocabulary (such as a colors embedding or another subdomain), it may be beneficial to adopt such a pretrained embedding.

If the vocabulary in the category is out of the ordinary, it may be better to train (or adapt) a domain-specific embedding.

This has the added benefit of much lower dimensionality.

A vector space that only needs to describe colors in relation to each other may be much smaller than one that tries to embed all words in the English language.

Domain specific solutionsDifference CodingDifference Coding may be helpful for a categorical value that describes ordinal properties.

Let’s assume there are just three forms of education: None, School, University.

In this example, the result would be three new variables with the following values:None: 0,0,0School: -0.

5,0.

5,0University: -0.

25,-0.

25, 0.

5Basically, the higher in the order, the more expressive are the values as more of the columns move away from a 0 value.

String metadataHow many words are in the category string, how long is the string, which letters occur how often?.While this may be extremely noisy information, a NN with a sufficiently large dataset may be able to extract some small predictive power from such information.

I remember Katie from linear digressions mentioning an intuition of hers once that went something along the lines of “if you can imagine a human deriving some value out of the data, a machine learning algorithm may very well do too”.

What does that mean for our dataset?.Well, a long job title may be indicative of a higher rank in a (arguably too deep) hierarchy chain, thus increasing the probability of earning a higher salary.

Quick detour: This page takes it to the extreme (and distracts from my bad example).

Sub informationLet’s jump back to the hypothetical email column.

A domain has a TLD which may hold information about the origin of the individual.

It may also give us an indicator about their affiliations with non-profit organizations (.

org) or if they work in an educational institution (.

edu).

This information may be valuable for the training of a NN.

Splitting the category into sub-columns may therefore be valuable.

Data enrichmentWhile many email domains such as those from Google Mail or Hotmail are very widespread over the population (and thus don’t contain a lot of information), custom domains may be a strong indicator for a persons salary, especially if those custom domains are not personal domains but belong to some organization.

There is also a plethora of data extractable from the web based on the domain name.

A crawler may be added to the pipeline that, for each encountered domain name, performs a HTTP call to the domain, extracts some keywords and uses them during training.

RecapConverting categories to something a neural network can process seems to be a common problem but finding more than a few ways to approach the problem seems to be hard.

This, this, this question suggest that there really aren’t many alternatives.

Ultimately, any method that requires a NN or a regression to convert the categories into some vector representation requires itself a numeric input to begin with.

There simply aren’t many other ways to map a set of alternative values other than simply numbering ( OrdinalEncoder ) or turning each possible value into its own binary dimension ( OneHotEncoder ).

Even the embeddings take the detour of a OneHotEncoder before passing a fixed-size vector into the prediction layers.

If you know of any other ways of encoding such values, please share it with me!.I’ll update the article accordingly and credit any author of course.

Further ReadingSummary of many coding systems: https://stats.

idre.

ucla.

edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis-2/Autoencoders, which may also offer a way to encode the inputs https://en.

wikipedia.

org/wiki/AutoencoderCategorical Values Wikipedia: https://en.

wikipedia.

org/wiki/Categorical_variable#Categorical_variables_and_regressionList of analyses of categorical data – WikipediaThis a list of statistical procedures which can be used for the analysis of categorical data, also known as data on the…en.

wikipedia.

orgTutorial – What is a variational autoencoder?.- Jaan AltosaarUnderstanding Variational Autoencoders (VAEs) from two perspectives: deep learning and graphical models.

jaan.

ioOn learning embeddings for categorical data using Keras(This is a breakdown and understanding of the implementation of Joe Eddy solution to Kaggle’s Safe Driver Prediction…medium.

comHow to handle different input sizes of an NN when One-Hot-Encoding a categorical input?Thanks for contributing an answer to Data Science Stack Exchange!.Some of your past answers have not been…datascience.

stackexchange.

comWhy One-Hot Encode Data in Machine Learning?Getting started in applied machine learning can be difficult, especially when working with real-world data.

Often…machinelearningmastery.

com.. More details

Leave a Reply