All you want to know about preprocessing: Data preparation

All you want to know about preprocessing: Data preparationThis is an introduction part, where we are going to discuss how to check and prepare your data for further preprocessing.

Maksym BalatskoBlockedUnblockFollowFollowingMay 29Nowadays, almost all ML/data mining projects workflow run on a standard CRISP-DM (Cross-industry standard process for data mining) or its IBM enhance ASUM-DM (Analytics Solutions Unified Method for Data Mining/Predictive Analytics).

The longest and the most important step in this workflow is Data preparation/preprocessing, which approximately takes 70% of the time.

This step is important because in most situations data provided by the customer has a bad quality or just cannot be directly fed to some kind of ML model.

My favorite byword, that I'll mention in all my posts, concerning data preprocessing, says: Garbage in, garbage out (GIGO).

In other words, if you feed your model with miserable data, don't expect it to perform well.

In this post we are going to discuss:Data typesData validationHandling datesHandling nominal and ordinal categorical valuesIn the next posts we are going to talk about more advanced preprocessing techniques:Data cleaning and standardization: Normalization and standartization, Handling missing data, Handling outliersFeature selection and dataset balancing: Dataset balancing, Feature extraction, Feature selection.

This post series represents the usual preprocessing flow order, but, in fact, all these parts are divided in such a way to be independent and not to require the knowledge of the previous parts.

This is an introduction part, where we are going to discuss how to check and prepare your data for further preprocessing.

Data typesTo start, let’s define what data types exist and what measurement scales they have:NumericDiscrete – integer values.

Example: number of products bought in the shopContinuos – any value in some admissable range (float, double).

Example: average length of words in textCategoricalThe variable value selected from a predefined number of categoriesOrdinal – categories could be meaningfully ordered.

Example: grade (A, B, C, D, E, F)Nominal – categories don't have any order.

Example: religion (Christian, Muslim, Hindu, etc.

)Dichotomous/Binary – the special case of nominal, with only 2 possible categories.

Example: gender (male, female)DateString, python datetime, timestamp.

Example: 12.

12.

2012TextMultidimensional data, more about text preprocessing see in my previous postImagesMultidimensional data, more about image preprocessing see in my next postsTime seriesData points indexed in the time order, more about time series preprocessing see in my next posts.

Data validationThe first step is the simplest and the most obvious: you have to investigate and validate your data.

To be able to validate the data you have to have a deep understanding of your data.

Easy rule: Don't dismiss the description of the dataset.

Validation step consists of:Data type and data representation consistency checkSame things have to be represented in the same way and in the same format.

Examples:Dates have the same format.

Several times in my practice I’ve got data where a part of dates was in American format, the other part in European.

Integers are really integers, not strings or floatsCategorical data doesn’t have duplicates because of whitespaces, lower/upper casesOther data representations don’t contain an errorData domain checkData is in range of permissible values.

Example: numerical variables are in admissable (min, max) range.

Data integrity checkCheck permitted relationships and fulfillment of the constraints.

Examples:Check name titles with sex, age of birth with ageHistorical data have the right chronology.

Delivery after purchase, Bank account opening before the first payment, etc.

The actions are made by allowed entities.

The mortgage could be approved only for people older than 18 years old, etc.

Ok, we have found some errors, what could we do?Correct them, if you are sure, what the problem is, or consult with the specialist or data provider if possible.

Discard samples with errors, in many cases it is a good choice because you aren't able to fulfill 1.

Do nothing, this, of course, could cause undesired effects in future steps.

Handling datesDifferent systems stores dates in different formats: 11.

12.

2019, 2016-02-12, Sep 24, 2003 etc.

But for building models on dates data, we need to somehow convert it to a numeric format.

To start, I'll show you an example of how to convert a date string into python datetime type, which is much more convenient for further steps.

The example is demonstrated on pandas dataframe.

Let's assume that date_string column contains dates in strings:# Converts date string column to python datetime type# `infer_datetime_format=True` says method to guess date format from stringdf['datetime'] = pd.

to_datetime(df['date_string'], infer_datetime_format=True)# Converts date string column to python datetime type# `format` argument specifies the format of date to parse, fails on errorsdf['datetime'] = pd.

to_datetime(df['date_string'], format='%Y.

%m.

%d')Frequently, just the year (YYYY) is sufficient.

But if we want to store months, days or even more detailed data, our numeric format has to fulfill 1 sufficient constraint, it has to save intervals, it means that for example, Monday — Friday in one week has to have the same difference as 1.

 — 5.

of any month.

So YYYYMMDD format will be not an option, because the last day of the month and the first day of the next month have a bigger distance than the first and the second day of the month.

Actually, there are 4 most common methods to transform date to numeric format:Unix timestampNumber of seconds since 1970Pros:perfectly preserve intervalsgood if hours, minutes and seconds mattersCons:values are non-obviousdon’t help intuition and knowledge discoveryharder to verify, easier to make an errorConverting datetime column to timestamp in pandas:# Coverts column in python datetime type to timestampdf['timestamp'] = df['datetime'].

values.

astype(np.

int64) // 10 ** 9KSP date formatPros:the year and quarter are obviouseasy intuition and knowledge discoverycan be extended to include timeCons:preserves intervals (almost)Converting python datetime column to KSP format in pandas:import datetime as dtimport calendardef to_ksp_format(datetime): year = datetime.

year day_from_jan_1 = (datetime – dt.

datetime(year, 1, 1)).

days is_leap_year = int(calendar.

isleap(year)) return year + (day_from_jan_1 – 0.

5) / (365 + is_leap_year)df['ksp_date'] = df['datetime'].

apply(to_ksp_format)Divide into several featuresYear, month, days, etc.

Cons:perfectly preserve intervalseasy intuition and knowledge discoveryPros:more dimensions you add, more complex your model could get, but it is not always bad.

Construct new featureConstruct a new feature based on date features.

For example:date of birth -> agedate order created and date order delivered -> time to delivery.

Cons:easy intuition and knowledge discoveryPros:manual feature construction might lead to important information lossHandling categorical valuesFlashback: Categorical – the variable value selected from a predefined number of categories.

Categorical values, like any other non-numeric types, has also to be converted into numeric values.

How to do it right?OrdinalCategories could be meaningfully ordered.

Can be converted into numeric values saving its natural order.

Grades: A+ – 4.

0, A- – 3.

7, B+ – 3.

3, B – 3.

0, etc.

Demonstration in pandas:grades = { 'A+': 4.

0, 'A-': 3.

7, 'B+': 3.

3, 'B' : 3.

0}df['grade_numeric'] = df['grade'].

apply(lambda x: grades[x])Dichotomous/BinaryOnly one of two possible categories.

In this case, you can convert values into indicator values 1/0.

For example: Male – 1 or Female – 0, or you can do it oppositely.

Demonstration in pandas:df['gender_indicator'] = df['gender'].

apply(lambda x: int(x.

lower() == 'Male'))NominalOne or more of all possible categories.

In this case, One hot encoding have to be used.

This method assumes creating an indicator value for every category(1- the sample is in the category, 0 – if not).

This method is applicable also for Dichotomous/Binary categorical values.

NEVER USE ORDINAL REPRESENTATION FOR NOMINAL VALUES, it would cause terrible side effects and your model will not be able to handle the categorical feature in the right way.

Demonstration in pandas:# Pandas `.

get_dummies()` methoddf = pd.

concat([df, pd.

get_dummies(df['category'], prefix='category')],axis=1)# now drop the original 'category' column (you don't need it anymore)df.

drop(['category'],axis=1, inplace=True)Demonstration in sklearn and pandas:from sklearn.

preprocessing import LabelEncoder, OneHotEncoderprefix = 'category'ohe = OneHotEncoder(sparse=False)ohe = ohe.

fit(df[['category']])onehot_encoded = ohe.

transform(df[['category']])features_names_prefixed = [ f"{prefix}_{category}" for category in onehot_encoder.

categories_[0]]df = pd.

concat([df, pd.

DataFrame(onehot_encoded, columns=features_names_prefixed)], axis=1)# now drop the original 'category' column (you don't need it anymore)df.

drop(['category'],axis=1, inplace=True)I hope you’ll like my post.

Feel free to ask questions in comments.

P.

S.

These are very and very basic and simple things, but they are very important in practice.

Much more interesting stuff is coming in the next posts!.

. More details

Leave a Reply