Data Preprocessing while cooking dinner?

Data Preprocessing while cooking dinner?Preparing your data made as simple as chopping onions, but without the tears ????Gus DantasBlockedUnblockFollowFollowingMay 30, 2018I read an article from Forbes.

com the other day stating that Data Preprocessing is responsible for 60% to 80% of the time of data scientists work.

As a fresh data person, this fact surprised me given the that Machine Learning algorithms for training models can be really complex and assumably, a great amount of time would have to be spent on coding them.

Isn’t it the same when cooking?Thinking about it made me very hungry, so I went to check out Kelley’s Fried Rice receipt and… Well, look at the ingredients list:2 tablespoons butter, divided3 eggs, whisked2 medium carrots, small dice1 small onion, small dice3 cloves garlic, minced1 cup frozen peas, thawedetc…Minced, divided, small diced, thawed.

Wow, Pretty much all the ingredients demand a preparing process before you can actually use them and, possibly, this is where Kelley spends most of her time when cooking this dish: Preparing her ingredients, or, preprocessing her raw data.

This article will show in a simple way how to proceed 4 important Data Preprocessing steps in Python and R before start building a model.

The reason why I write it is to help people that are starting their career (just like me) to use and understand these important steps:Dealing with Missing DataEncoding Categorical DataSplitting training set and test setFeature Scaling1.

Dealing with Missing DataOriginal DatasetOne of the things that come to my mind when I see a missing data dataset (see image) is to simply ignore the lines in question.

Well, it’s obviously not the best option — at least not for most application — since that could severely impact the model especially in a small dataset.

A good decision could’ve been finding the mean (or another meaningful measure) of the values of the column and put it in the missing value spot.

However, how to do it using common Machine Learning libraries?The piece of code in Python below replaces missing values of the dataset shown above to the mean of the column using the class Imputer from the Scikit-Learn Preprocessing library.

The “strategy” parameter can be changed if the mean is not really the best decision for your dataset though.

# Data Preprocessing# Importing the librariesimport numpy as npimport matplotlib.

pyplot as pltimport pandas as pd# Importing the datasetdataset = pd.

read_csv('Data.

csv')X = dataset.

iloc[:, :-1].

valuesy = dataset.

iloc[:, 3].

values# Taking care of missing datafrom sklearn.

preprocessing import Imputerimputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis=0)imputer = imputer.

fit(X[:, 1:3])X[:, 1:3] = imputer.

transform(X[:, 1:3])Analogously, in R, this can be made even simples using ifelse loop:# Data Preprocessing# Importing the datasetdataset = read.

csv('Data.

csv')# Taking care of missing datadataset$Age = ifelse(is.

na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.

rm = TRUE)), dataset$Age)dataset$Salary = ifelse(is.

na(dataset$Salary), ave(dataset$Salary, FUN = function(x) mean(x, na.

rm = TRUE)), dataset$Salary)2.

Encoding Categorical DataLooking at the original dataset again, it is easy to see that if we want to train a mathematical model using it, it must have to be a numeric-only dataset.

Having it in mind, how would the model see categorical data like Country (France, Spain or Germany) and Purchased (Yes, No)?This is obviously not as simple as calling them numbers (i.

e.

: France = 1, Spain = 2, Germany = 3), if we do so, our model would assume Germany has a higher value than France and Spain which doesn’t make sense.

This is the reason why we need to encode categorical data, so the model will look at it as a number but will not qualify it numerically.

Basically it consists in adding n columns in the dataset where n is the number of categories.

In our dataset:The Python piece of code below shows how to encode categorical data using OneHotEncoder and LabelEncoder classes from the Scikit-Learn Preprocessing library for our dataset above.

# Data Preprocessing# Importing the librariesimport numpy as npimport matplotlib.

pyplot as pltimport pandas as pd# Importing the datasetdataset = pd.

read_csv('Data.

csv')X = dataset.

iloc[:, :-1].

valuesy = dataset.

iloc[:, 3].

values# Encoding categorical data# Encoding the Independent Variablefrom sklearn.

preprocessing import LabelEncoder, OneHotEncoderlabelencoder_X = LabelEncoder()X[:, 0] = labelencoder_X.

fit_transform(X[:, 0])onehotencoder = OneHotEncoder(categorical_features = [0])X = onehotencoder.

fit_transform(X).

toarray()# Encoding the Dependent Variablelabelencoder_y = LabelEncoder()y = labelencoder_y.

fit_transform(y)Analogously, in R, much simpler:# Data Preprocessing# Importing the datasetdataset = read.

csv('Data.

csv')# Encoding categorical datadataset$Country = factor(dataset$Country, levels = c('France', 'Spain', 'Germany'), labels = c(1, 2, 3))dataset$Purchased = factor(dataset$Purchased, levels = c('No', 'Yes'), labels = c(0, 1))3.

Splitting Training set and Test setSeparating the training set from the test set is SO important that might not even be understood as an optional data preprocessing step.

This separation will pretty much happen every time before training a model.

Basically, it consists in dividing the dataset observation into two groups: One to train the model — Training Set, and the other one to test the model that was trained with the first one — Test Set.

It is reasonable to assign 75%-80% of the observations to the Training Set and 25%-20% to the Test Set, however, be always aware of your application, the accuracy you need and the size of your dataset.

The Python piece of code below uses train_test_split class from Scikit-Learn Cross Validation library to split the sets.

# Data Preprocessing Template# Importing the librariesimport numpy as npimport matplotlib.

pyplot as pltimport pandas as pd# Importing the datasetdataset = pd.

read_csv('Data.

csv')X = dataset.

iloc[:, :-1].

valuesy = dataset.

iloc[:, 3].

values# Splitting the dataset into the Training set and Test setfrom sklearn.

cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.

2, random_state = 0)Analogously, in R (Note you just need to run install.

packages(caTools) once):# Data Preprocessing Template# Importing the datasetdataset = read.

csv('Data.

csv')# dataset = dataset[,2:3]# Splitting the dataset into the Training set and Test set# install.

packages('caTools')library(caTools)set.

seed(123)split = sample.

split(dataset$Purchased, SplitRatio = 0.

8)training_set = subset(dataset, split == TRUE)test_set = subset(dataset, split == FALSE)4.

Feature ScalingMost mathematical models are based on Euclidian Distances which means that the square distances can be overwhelming different from variable to variable making the model behave weirdly.

Feature Scaling is the process that puts all the values in the same “range” of greatness so it doesn’t happen that one variable will end up dominating the other when it shouldn’t happen.

For example: Looking at our original dataset again it is easy to see that Salary values are overwhelmingly greater than Age values, so to prevent the model from disregarding Age super regarding Salary, Feature Scaling step should be performed.

Unlike the previous step, Feature Scaling is not something that will happen always when preparing data.

Some models/methods have this step ‘in-built’ and for some other models this difference in scale do not interfere in the result, also, it is not uncommon to find datasets where all the variables already have somehow the same scale.

The piece of code in Python below uses the class StandardScaler from the library Scikit-Lear Preprocessing to perform Feature Scaling.

# Data Preprocessing Template# Importing the librariesimport numpy as npimport matplotlib.

pyplot as pltimport pandas as pd# Importing the datasetdataset = pd.

read_csv('Data.

csv')X = dataset.

iloc[:, :-1].

valuesy = dataset.

iloc[:, 3].

values# Splitting the dataset into the Training set and Test setfrom sklearn.

cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.

2, random_state = 0)# Feature Scalingfrom sklearn.

preprocessing import StandardScalersc_X = StandardScaler()X_train = sc_X.

fit_transform(X_train) #Fit and transformX_test = sc_X.

transform(X_test) #Only trasformsc_y = StandardScaler()y_train = sc_y.

fit_transform(y_train)Analogously in R:# Data Preprocessing Template# Importing the datasetdataset = read.

csv('Data.

csv')# dataset = dataset[,2:3]# Splitting the dataset into the Training set and Test set# install.

packages('caTools')library(caTools)set.

seed(123)split = sample.

split(dataset$Purchased, SplitRatio = 0.

8)training_set = subset(dataset, split == TRUE)test_set = subset(dataset, split == FALSE)# Feature Scalingtraining_set[,2:3] = scale(training_set[,2:3])test_set[,2:3] = scale(test_set[,2:3])Data preprocessing template!We all love templates!.So, after carefully preparing all the ingredients, we are ready to start cooking our delicious dinner, I mean, model!The Python code below is a complete Data Preprocessing Template:# Data Preprocessing Template# Importing the librariesimport numpy as npimport matplotlib.

pyplot as pltimport pandas as pd# Importing the datasetdataset = pd.

read_csv('Data.

csv')X = dataset.

iloc[:, :-1].

valuesy = dataset.

iloc[:, 3].

values# Taking care of missing datafrom sklearn.

preprocessing import Imputerimputer = Imputer("NaN","mean",0)imputer = imputer.

fit(X[:,1:3])X[:,1:3] = imputer.

transform(X[:,1:3])#Encoding categorical datafrom sklearn.

preprocessing import LabelEncoder, OneHotEncoderlabelencoder_X = LabelEncoder()X[:,0] = labelencoder_X.

fit_transform(X[:,0])onehotencoder = OneHotEncoder(categorical_features = [0])X = onehotencoder.

fit_transform(X).

toarray()labelencoder_y = LabelEncoder()y = labelencoder_y.

fit_transform(y)y = onehotencoder.

fit_transform(X).

toarray()# Splitting the dataset into the Training set and Test setfrom sklearn.

cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.

2, random_state = 0)# Feature Scalingfrom sklearn.

preprocessing import StandardScalersc_X = StandardScaler()X_train = sc_X.

fit_transform(X_train) #Fit and transformX_test = sc_X.

transform(X_test) #Only trasformsc_y = StandardScaler()y_train = sc_y.

fit_transform(y_train)And analogously, here it goes the Data Preprocessing Template in R:# Data Preprocessing Template# Importing the datasetdataset = read.

csv('Data.

csv')# dataset = dataset[,2:3]#Taking care of the missing datadataset$Age = ifelse(is.

na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.

rm = TRUE)), dataset$Age)dataset$Salary = ifelse(is.

na(dataset$Salary), ave(dataset$Salary, FUN = function (y) mean(y, na.

rm = TRUE)), dataset$Salary)# Encoding category datadataset$Country = factor(dataset$Country, levels = c('France','Spain','Germany'), labels = c(1,2,3))dataset$Purchased = factor(dataset$Purchased, levels = c('No','Yes'), labels = c(1,2))# Splitting the dataset into the Training set and Test set# install.

packages('caTools')library(caTools)set.

seed(123)split = sample.

split(dataset$Purchased, SplitRatio = 0.

8)training_set = subset(dataset, split == TRUE)test_set = subset(dataset, split == FALSE)# Feature Scalingtraining_set[,2:3] = scale(training_set[,2:3])test_set[,2:3] = scale(test_set[,2:3])Hey There!.My name is Gustavo Dantas, I live in Sydney and I am a newbie in this Data world.

✌????I aim to write regular articles with my learning and findings in this field in order to build myself stronger knowledge and help people.

????Please, excuse this article if it is too straightforward and excuse my elementary English.

Clap if you like and comment your findings and opinions.

????????Connect with me on LinkedIn!Gus.

.

. More details

Leave a Reply