A kind of “Hello, World!”​ in ML (using a basic workflow)

Photo by Martim Braz on UnsplashA kind of “Hello, World!”​ in ML (using a basic workflow)Antonello Calamea, CTO and certified ML enthusiastBlockedUnblockFollowFollowingFeb 21Some time ago, a friend of mine told me she has to start to deal with ML topics and asked me more about it, so I prepared this little example, a kind of “Hello, World!”, to show her the process to find and predict info from data.

IMHO, the most important thing is to define a workflow, something to follow during the analysis because it helps A LOT having one.

This is mine:Define objectivesCollect dataUnderstand and prepare the dataCreate and evaluate the ModelWe’ll arrive here in this post but it’s not over…you have then to:Refine the ModelDeployVery important: it’s an iterative process and every step can be improved, affecting the outcome of the next ones.

Let’s start with the example!1) Define objectivesWhat I have to do and what kind of problem I have to resolve?The objective is to predict the price of an house (target), based on several variables describing the characteristics of the buildings (features)As the prediction is a continuous value and both features and target values are available in the dataset, this is supervised regression problem.

In simpler worlds, if someone gives me new values for the features (how much the house is big, overall quality, number of bathrooms, etc), I want to have a model that can answer with an estimated sale price.

No more to add, so let’s dive into….

2) Collect dataThe test and train datasets are available on Kaggle.

We’ll use Colaboratory from Google, a Jupyter cloud environment, already powered with a lot of libraries and with the possibility to have free GPUs to run complex stuff…but this is not the case.

Let’s start with boilerplate code to retrieve the files from GDrive, previously downloaded from Kaggle.

# code ti retrieve file from GDrive!pip install -U -q PyDrive from pydrive.

auth import GoogleAuthfrom pydrive.

drive import GoogleDrivefrom google.

colab import authfrom oauth2client.

client import GoogleCredentials # 1.

Authenticate and create the PyDrive client.

auth.

authenticate_user()gauth = GoogleAuth()gauth.

credentials = GoogleCredentials.

get_application_default()drive = GoogleDrive(gauth)file_list = drive.

ListFile({'q': "'<folder id>' in parents and trashed=false"}).

GetList()for file1 in file_list: print('title: %s, id: %s' % (file1['title'], file1['id'])) # create local filehouse_prices_train_downloaded = drive.

CreateFile({'id': '<file id>'})house_prices_train_downloaded.

GetContentFile('house_prices_train.

csv') house_prices_test_downloaded = drive.

CreateFile({'id': '<file id>'})house_prices_test_downloaded.

GetContentFile('house_prices_test.

csv')Let’s import some libraries and take a look at the data# Pandas and numpy for data manipulationimport pandas as pdimport numpy as nppd.

set_option("display.

max_columns",100)# No warnings about setting value on copy of slicepd.

options.

mode.

chained_assignment = None# Display up to 60 columns of a dataframepd.

set_option('display.

max_columns', 60)# Matplotlib visualizationimport matplotlib.

pyplot as plt%matplotlib inline# Set default font sizeplt.

rcParams['font.

size'] = 24# Internal ipython tool for setting figure sizefrom IPython.

core.

pylabtools import figsize# Seaborn for visualizationimport seaborn as snssns.

set(font_scale = 2)from IPython.

display import displayoriginal_train_set = pd.

read_csv('house_prices_train.

csv')display(original_train_set.

info())<class 'pandas.

core.

frame.

DataFrame'>RangeIndex: 1460 entries, 0 to 1459Data columns (total 81 columns):Id 1460 non-null int64MSSubClass 1460 non-null int64MSZoning 1460 non-null objectLotFrontage 1201 non-null float64LotArea 1460 non-null int64Street 1460 non-null objectAlley 91 non-null objectLotShape 1460 non-null objectLandContour 1460 non-null objectUtilities 1460 non-null objectLotConfig 1460 non-null objectLandSlope 1460 non-null objectNeighborhood 1460 non-null objectCondition1 1460 non-null objectCondition2 1460 non-null objectBldgType 1460 non-null objectHouseStyle 1460 non-null objectOverallQual 1460 non-null int64OverallCond 1460 non-null int64YearBuilt 1460 non-null int64YearRemodAdd 1460 non-null int64RoofStyle 1460 non-null objectRoofMatl 1460 non-null objectExterior1st 1460 non-null objectExterior2nd 1460 non-null objectMasVnrType 1452 non-null objectMasVnrArea 1452 non-null float64ExterQual 1460 non-null objectExterCond 1460 non-null objectFoundation 1460 non-null objectBsmtQual 1423 non-null objectBsmtCond 1423 non-null objectBsmtExposure 1422 non-null objectBsmtFinType1 1423 non-null objectBsmtFinSF1 1460 non-null int64BsmtFinType2 1422 non-null objectBsmtFinSF2 1460 non-null int64BsmtUnfSF 1460 non-null int64TotalBsmtSF 1460 non-null int64Heating 1460 non-null objectHeatingQC 1460 non-null objectCentralAir 1460 non-null objectElectrical 1459 non-null object1stFlrSF 1460 non-null int642ndFlrSF 1460 non-null int64LowQualFinSF 1460 non-null int64GrLivArea 1460 non-null int64BsmtFullBath 1460 non-null int64BsmtHalfBath 1460 non-null int64FullBath 1460 non-null int64HalfBath 1460 non-null int64BedroomAbvGr 1460 non-null int64KitchenAbvGr 1460 non-null int64KitchenQual 1460 non-null objectTotRmsAbvGrd 1460 non-null int64Functional 1460 non-null objectFireplaces 1460 non-null int64FireplaceQu 770 non-null objectGarageType 1379 non-null objectGarageYrBlt 1379 non-null float64GarageFinish 1379 non-null objectGarageCars 1460 non-null int64GarageArea 1460 non-null int64GarageQual 1379 non-null objectGarageCond 1379 non-null objectPavedDrive 1460 non-null objectWoodDeckSF 1460 non-null int64OpenPorchSF 1460 non-null int64EnclosedPorch 1460 non-null int643SsnPorch 1460 non-null int64ScreenPorch 1460 non-null int64PoolArea 1460 non-null int64PoolQC 7 non-null objectFence 281 non-null objectMiscFeature 54 non-null objectMiscVal 1460 non-null int64MoSold 1460 non-null int64YrSold 1460 non-null int64SaleType 1460 non-null objectSaleCondition 1460 non-null objectSalePrice 1460 non-null int64dtypes: float64(3), int64(35), object(43)memory usage: 924.

0+ KBThe columns descriptions are available at this linkOther observations:SalePrice is the targetThere are 80 features columns, both categorical and numericalThere are sufficient number of samples (1460 rows) respect the number of features19 columns have missing values (we’ll deal with this on next step)To help with data preparation, let’use a library called SpeedML, allowing to do several operations with fewer commandsLet’s install it (with pip!) and initialize with the two test and train dataframes!pip install speedmlfrom speedml import Speedmlsml = Speedml('house_prices_train.

csv', 'house_prices_test.

csv', target = 'SalePrice', uid = 'Id')Collecting speedml Downloading https://files.

pythonhosted.

org/packages/b1/72/91dcc93415b09829897b3d34a87383a946b720771b6d1662fbc017782b6c/speedml-0.

9.

3-py2.

py3-none-any.

whlRequirement already satisfied: future in /usr/local/lib/python3.

6/dist-packages (from speedml) (0.

16.

0)Requirement already satisfied: seaborn in /usr/local/lib/python3.

6/dist-packages (from speedml) (0.

7.

1)Requirement already satisfied: pandas in /usr/local/lib/python3.

6/dist-packages (from speedml) (0.

22.

0)Collecting sklearn (from speedml) Downloading https://files.

pythonhosted.

org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.

0.

tar.

gzRequirement already satisfied: numpy in /usr/local/lib/python3.

6/dist-packages (from speedml) (1.

14.

3)Requirement already satisfied: matplotlib in /usr/local/lib/python3.

6/dist-packages (from speedml) (2.

1.

2)Requirement already satisfied: xgboost in /usr/local/lib/python3.

6/dist-packages (from speedml) (0.

7.

post4)Requirement already satisfied: pytz>=2011k in /usr/local/lib/python3.

6/dist-packages (from pandas->speedml) (2018.

4)Requirement already satisfied: python-dateutil>=2 in /usr/local/lib/python3.

6/dist-packages (from pandas->speedml) (2.

5.

3)Requirement already satisfied: scikit-learn in /usr/local/lib/python3.

6/dist-packages (from sklearn->speedml) (0.

19.

1)Requirement already satisfied: six>=1.

10 in /usr/local/lib/python3.

6/dist-packages (from matplotlib->speedml) (1.

11.

0)Requirement already satisfied: pyparsing!=2.

0.

4,!=2.

1.

2,!=2.

1.

6,>=2.

0.

1 in /usr/local/lib/python3.

6/dist-packages (from matplotlib->speedml) (2.

2.

0)Requirement already satisfied: cycler>=0.

10 in /usr/local/lib/python3.

6/dist-packages (from matplotlib->speedml) (0.

10.

0)Requirement already satisfied: scipy in /usr/local/lib/python3.

6/dist-packages (from xgboost->speedml) (0.

19.

1)Building wheels for collected packages: sklearn Running setup.

py bdist_wheel for sklearn .

done Stored in directory: /content/.

cache/pip/wheels/76/03/bb/589d421d27431bcd2c6da284d5f2286c8e3b2ea3cf1594c074Successfully built sklearnInstalling collected packages: sklearn, speedmlSuccessfully installed sklearn-0.

0 speedml-0.

9.

32) Understand and prepare the data3) Understand and prepare the dataThis is a very important step, because here you set the foundation of the whole workWe can divide the process in sub steps:Basic data preparation (deal with missing values, outliers, etc)EDA (exploratory data analysis) to gather more information about the dataset (distribution, correlations, etc) and have a better knowledge of the dataFeature selection — choose the most relevant featuresFeature engineering — create new features from existing ones or other available data3.

1 Data preparationLet’s deal first with the missing valuesdef missing_values_table(df): mis_val = df.

isnull().

sum() mis_val_percent = 100 * df.

isnull().

sum() / len(df) mis_val_table = pd.

concat([mis_val, mis_val_percent], axis=1) mis_val_table_ren_columns = mis_val_table.

rename( columns = {0 : 'Missing Values', 1 : '% '}) # Sort the table by percentage of missing descending mis_val_table_ren_columns = mis_val_table_ren_columns[ mis_val_table_ren_columns.

iloc[:,1] != 0].

sort_values( '% ', ascending=False).

round(1) return mis_val_table_ren_columnsmissing_values_table(sml.

train)In this case, we’ll just drop the first 4 columns with higher number of missing valuessml.

feature.

drop(['PoolQC','MiscFeature','Alley','Fence'])'Dropped 4 features with 76 features available.

'SpeedML will drop columns both in train and test dataset, to avoid inconsistency.

Let’s fill all the rest with median or most present text values with just a single command (impute) and let’s check results (being an example we can do it without too much problems, but could be important to choose the best strategy for every column, especially to improve results)sml.

feature.

impute()missing_values_table(sml.

train)display(sml.

train.

info())'Imputed 1558 empty values to 0.

'Your selected dataframe has 76 columns.

There are 0 columns that have missing values.

Missing Values % of Total Values<class 'pandas.

core.

frame.

DataFrame'>Int64Index: 1460 entries, 0 to 1459Data columns (total 76 columns):MSSubClass 1460 non-null int64MSZoning 1460 non-null objectLotFrontage 1460 non-null float64LotArea 1460 non-null int64Street 1460 non-null objectLotShape 1460 non-null objectLandContour 1460 non-null objectUtilities 1460 non-null objectLotConfig 1460 non-null objectLandSlope 1460 non-null objectNeighborhood 1460 non-null objectCondition1 1460 non-null objectCondition2 1460 non-null objectBldgType 1460 non-null objectHouseStyle 1460 non-null objectOverallQual 1460 non-null int64OverallCond 1460 non-null int64YearBuilt 1460 non-null int64YearRemodAdd 1460 non-null int64RoofStyle 1460 non-null objectRoofMatl 1460 non-null objectExterior1st 1460 non-null objectExterior2nd 1460 non-null objectMasVnrType 1460 non-null objectMasVnrArea 1460 non-null float64ExterQual 1460 non-null objectExterCond 1460 non-null objectFoundation 1460 non-null objectBsmtQual 1460 non-null objectBsmtCond 1460 non-null objectBsmtExposure 1460 non-null objectBsmtFinType1 1460 non-null objectBsmtFinSF1 1460 non-null float64BsmtFinType2 1460 non-null objectBsmtFinSF2 1460 non-null float64BsmtUnfSF 1460 non-null float64TotalBsmtSF 1460 non-null float64Heating 1460 non-null objectHeatingQC 1460 non-null objectCentralAir 1460 non-null objectElectrical 1460 non-null object1stFlrSF 1460 non-null int642ndFlrSF 1460 non-null int64LowQualFinSF 1460 non-null int64GrLivArea 1460 non-null int64BsmtFullBath 1460 non-null float64BsmtHalfBath 1460 non-null float64FullBath 1460 non-null int64HalfBath 1460 non-null int64BedroomAbvGr 1460 non-null int64KitchenAbvGr 1460 non-null int64KitchenQual 1460 non-null objectTotRmsAbvGrd 1460 non-null int64Functional 1460 non-null objectFireplaces 1460 non-null int64FireplaceQu 1460 non-null objectGarageType 1460 non-null objectGarageYrBlt 1460 non-null float64GarageFinish 1460 non-null objectGarageCars 1460 non-null float64GarageArea 1460 non-null float64GarageQual 1460 non-null objectGarageCond 1460 non-null objectPavedDrive 1460 non-null objectWoodDeckSF 1460 non-null int64OpenPorchSF 1460 non-null int64EnclosedPorch 1460 non-null int643SsnPorch 1460 non-null int64ScreenPorch 1460 non-null int64PoolArea 1460 non-null int64MiscVal 1460 non-null int64MoSold 1460 non-null int64YrSold 1460 non-null int64SaleType 1460 non-null objectSaleCondition 1460 non-null objectSalePrice 1460 non-null int64dtypes: float64(11), int64(26), object(39)memory usage: 878.

3+ KBNice, no more missing data.

3.

2 EDATo keep it simple, let’s find the most important features correlated to the targetsml.

train[sml.

train.

columns[0:]].

corr()['SalePrice'][:-1].

sort_values()KitchenAbvGr -0.

135907EnclosedPorch -0.

128578MSSubClass -0.

084284OverallCond -0.

077856YrSold -0.

028923LowQualFinSF -0.

025606MiscVal -0.

021190BsmtHalfBath -0.

016844BsmtFinSF2 -0.

0113783SsnPorch 0.

044584MoSold 0.

046432PoolArea 0.

092404ScreenPorch 0.

111447BedroomAbvGr 0.

168213BsmtUnfSF 0.

214479BsmtFullBath 0.

227122LotArea 0.

263843HalfBath 0.

284108OpenPorchSF 0.

3158562ndFlrSF 0.

319334WoodDeckSF 0.

324413LotFrontage 0.

334544BsmtFinSF1 0.

386420Fireplaces 0.

466929GarageYrBlt 0.

469056MasVnrArea 0.

472614YearRemodAdd 0.

507101YearBuilt 0.

522897TotRmsAbvGrd 0.

533723FullBath 0.

5606641stFlrSF 0.

605852TotalBsmtSF 0.

613581GarageArea 0.

623431GarageCars 0.

640409GrLivArea 0.

708624OverallQual 0.

790982Name: SalePrice, dtype: float64The correlation can be positive or negative (in range [-1,1]).

The higher positive features (OverallQual,GrLivArea, .

) make sense, because the price is directly proportional to their values.

Let’s visualize all the correlations with a correlation matrixQuite a puzzle!.Here it’s possible to see correlation not only between target and features, but even between features (look for example at the high correlation between GrLivArea and TotRmsAboveGrd, something that has perfect sense because if there is more space on the ground more rooms can be built above it)3.

2 Feature selectionLet’s focus on the most correlated and let’s remove outliers, using the standard definition of +/- 3 IQR deviation.

An operation to do with caution, because outliers can be useful data too…columns_of_interest = ['OverallQual','GrLivArea','GarageCars','GarageArea', 'TotalBsmtSF','1stFlrSF','FullBath','TotRmsAbvGrd', 'YearBuilt','YearRemodAdd'] sml.

train.

loc[:,columns_of_interest].

describe() def remove_outliers(df, columns): for c in columns: print('Removing outliers from ', c) first_quartile = df[c].

describe()['25%'] third_quartile = df[c].

describe()['75%'] # Interquartile range iqr = third_quartile – first_quartile # Remove outliers df = df[(df[c] > (first_quartile – 3 * iqr)) & (df[c] < (third_quartile + 3 * iqr))] return df sml.

train = remove_outliers(sml.

train, columns_of_interest)sml.

train.

loc[:,columns_of_interest].

describe()sml.

train.

shapeRemoving outliers from OverallQualRemoving outliers from GrLivAreaRemoving outliers from GarageCarsRemoving outliers from GarageAreaRemoving outliers from TotalBsmtSFRemoving outliers from 1stFlrSFRemoving outliers from FullBathRemoving outliers from TotRmsAbvGrdRemoving outliers from YearBuiltRemoving outliers from YearRemodAddWe can gain some insights here (overall quality mean vote is around 6, the most ancient house was built in 1872 and so on.

But let’s visualize the data to find more infoLet’s see the distribution of the target (the sale price)_ = sns.

distplot(original_train_set['SalePrice'])The most values are under 400K.

Another way to see this is to plot the ECDF, showing the cumulative distribution of the price.

def ecdf(data): """Compute ECDF for a one-dimensional array of measurements.

""" # Number of data points: n n = len(data) # x-data for the ECDF: x x = np.

sort(data) # y-data for the ECDF: y y = np.

arange(1, n+1) / n return x, yIt’s more clear here: almost the total of the values are under the 400K of sale price, with the 75% a bit more under 200KLet’s now do some multivariate analysis between target and most relevant features, expecting to see a positive correlation_ = sns.

jointplot(x="GrLivArea", y="SalePrice", data=sml.

train)sml.

plot.

bar("OverallQual", "SalePrice")sml.

plot.

bar("GarageCars", "SalePrice");plt.

show()Yep, the sale price is definitely rising with the living area (the Pearson correlation coefficient of 0.

72 tells us there is a quite strong positive correlation), quality and the garage size, with the exception of the 4 cars garage size value (something could be interesting to investigate)As a final step, let’s transform the categorical columns in something numeric, so can be used by a ML algorithm# Select the object columnsobject_columns = sml.

train.

select_dtypes('object').

columnssml.

train = pd.

get_dummies(sml.

train, columns = object_columns)sml.

train.

shape(1449, 275)The columns number increased a lot after the change of categorical features (now are 275) but only in this way the model can be trained because everything is numeric.

3.

4 Features engineeringBeing an “Hello, World!”, we’ll use the data as is, without creating new features, but this step is very important because can feed the model with more useful data.

4 Prepare the modelLet’s split train data in 70% (train) ad 30% (test)from sklearn.

model_selection import train_test_splitfeatures = sml.

train.

drop(columns='SalePrice')targets = pd.

DataFrame(sml.

train['SalePrice'])# Replace the inf and -inf with nan (required for later imputation)features = features.

replace({np.

inf: np.

nan, -np.

inf: np.

nan})# Split into 70% training and 30% testing setX_train, X_test, y_train, y_test = train_test_split(features, targets, test_size = 0.

3, random_state = 42)Now we can start to work with a model, but first….

let’s define a baseline to beat with our model4.

1 Define a baselineLet’s choose the mean absolute error (MAE) as KPI and let’s evaluate its value with a naive model, using always the sales price median value (163250$)# Function to calculate mean absolute errordef mae(y_true, y_pred): return np.

mean(abs(y_true – y_pred)) baseline_guess = np.

median(y_test)print('The baseline guess is %0.

2f' % baseline_guess)print("Baseline Performance on the test set: MAE = %0.

4f" % mae(y_test, baseline_guess))The baseline guess is 163250.

00Baseline Performance on the test set: MAE = 51501.

8644So the MAE to beat is 51501.

864.

2 Train the simplest modelIn this case, let’s use a linear regression with basic parametersfrom sklearn.

linear_model import LinearRegressionlr = LinearRegression() # Train the modellr.

fit(X_train, y_train) # Make predictions and evalutelr_pred = model.

predict(X_test)lr_mae = mae(y_test, lr_pred) print('Linear Regression Performance on the test set: MAE = %0.

4f' % lr_mae)Linear Regression Performance on the test set: MAE = 17273.

87014.

3 Evaluate the modelWow, we beat naive baseline, improving by 33%…let’s compare the real value with predicted ones_ = plt.

plot(list(y_test.

iloc[:,0]), marker='o', linestyle='none', alpha=0.

2, label='real values')_ = plt.

plot(model_pred, marker='.

', linestyle='none', label = 'predicted')_ = plt.

xlabel('number of samples')_ = plt.

ylabel('SalePrice')plt.

show()ax = sns.

distplot(model_pred, color='red', kde=True)ax = sns.

distplot(list(y_test.

iloc[:,0]), kde=True)ax.

set(xlabel='SalePrice', ylabel='probability')The model seems to act poorly nearly under 200K, but it’s a start and definitely a good point for an hello world example.

Speaking of this….

Hello, World! :).

. More details

Leave a Reply