Predictive Modeling: Picking the Best Model

Predictive Modeling: Picking the Best ModelTesting out different types of models on the same dataKailey SmithBlockedUnblockFollowFollowingFeb 8Whether you are working on predicting data in an office setting or just competing in a Kaggle competition, it’s important to test out different models to find the best fit for the data you are working with.

I recently had the opportunity to compete with some very smart colleagues in a private Kaggle competition predicting faulty water pumps in Tanzania.

I ran the following models after doing some data cleaning and I’ll show you the results.

Logistic RegressionRandom ForestRidge RegressionK-nearest NeighborsXGBoostLoading the DataFirst, we need to take a look at the data we’re working with.

In this particular data set, the features were in a separate file than the labels.

import pandas as pdpd.

set_option('display.

max_columns', None)X_df = pd.

read_csv('.

/train_features.

csv')X_df.

head()y_df = pd.

read_csv('.

/train_labels.

csv')y_df.

head()We can see that the status_group or target label is a string, some models are able to work without having to modify that, but others don’t.

We’ll do something about that when we get to it later.

Let’s check out our distribution of the target label.

y_df['status_group'].

value_counts(normalize=True)This split shows that we have exactly 3 classes in the label, so we have a multiclass classification.

The majority class is ‘functional’, so if we were to just assign functional to all of the instances our model would be .

54 on this training set.

This is called the majority class baseline and is our target to beat with the models we run.

Data Cleaning and Feature EngineeringThere are a lot of features in this data set, so I’m not going to go into detail on every single thing I did, but I’ll go through high level, step by step.

First, we want to check things out by looking at all the features and data types.

X_df.

info()There are 30 object features that we’ll need to work with in order to be able to use them in a model.

The int and float objects can just be used as is.

Another thing to look at is high cardinality features.

If we have more than 100 categories for each of these features, it won’t be very useful to use them.

It would add dimensions to our dataset and we don’t want to do that.

Before we drop these high cardinality columns though, I see that the date_recorded is an object and that will most definitely get dropped with our high cardinality features, so I created some features off of that.

#So date doesn't get dropped in next stepX_df['date_recorded'] = pd.

to_datetime(X_df['date_recorded'])X_df['YearMonth'] = X_df['date_recorded'].

map(lambda x: 100*x.

year + x.

month)X_df['Year'] = X_df['date_recorded'].

map(lambda x: x.

year)X_df['Month'] = X_df['date_recorded'].

map(lambda x: x.

month)Now that we’ve got the date sorted out we can check for high cardinality and drop those features.

max_cardinality = 100high_cardinality = [col for col in X_df.

select_dtypes(exclude=np.

number) if X_df[col].

nunique() > max_cardinality]X_df = X_df.

drop(columns=high_cardinality)X_df.

info()So, we dropped 8 features with high cardinality.

Now we can use OneHotEncoder or Pandas get_dummies() to change these objects to ints.

Now that all our features are now numerical, let’s get into the models!Logistic RegressionLogistic Regression is great for multiclass classification because Scikit-learn encodes encodes the target labels automatically if they are strings.

First, we need to split our data into train and test.

from sklearn.

preprocessing import scalefrom sklearn.

model_selection import train_test_splitX = X.

drop(columns='id') #id is our index and won't help our modelX = scale(X) X_train, X_test, y_train, y_test = train_test_split( X, y, train_size=0.

75, test_size=0.

25, random_state=42, shuffle=True)When you’re working with a learning model, it is important to scale the features to a range which is centered around zero.

Scaling will make sure the variance of the features are in the same range.

Now, we’ll run the model on both train and test and see what our accuracy score is.

from sklearn.

linear_model import LogisticRegressionfrom sklearn.

metrics import accuracy_scorelogreg = LogisticRegression()logreg.

fit(X_train,y_train)y_pred = logreg.

predict(X_train)print('Train accuracy score:',accuracy_score(y_train,y_pred))print('Test accuracy score:', accuracy_score(y_test,logreg.

predict(X_test)))We’re definitely beating our majority class baseline of .

54 here with .

73 for train and test.

Let’s see if another model can do better.

Random ForestRandom Forest can also take strings as our target labels, so we can just run the model with the same train test split.

from sklearn.

ensemble import RandomForestClassifier as RFCrfc_b = RFC()rfc_b.

fit(X_train,y_train)y_pred = rfc_b.

predict(X_train)print('Train accuracy score:',accuracy_score(y_train,y_pred))print('Test accuracy score:', accuracy_score(y_test,rfc_b.

predict(X_test)))Random Forest beats Logistic Regression on train and test with .

97 on train and .

79 on test.

Ridge RegressionFor Ridge Regression we’ll need to encode the target labels before running the model.

X = X_df.

drop(columns=['id'])X = scale(X)y = y_df.

drop(columns='id')y = y.

replace({'functional':0, 'non functional':2,'functional needs repair':1 })X_train, X_test, y_train, y_test = train_test_split( X, y, train_size=0.

75, test_size=0.

25, random_state=42, shuffle=True)Now we run the model.

Ridge outputs a probability in it’s predict() method, so we’ll have to update that with numpy in order to get the actual predictions.

from sklearn.

linear_model import Ridgeimport numpy as npridge = Ridge()ridge.

fit(X_train,y_train)y_prob = ridge.

predict(X_train)y_pred = np.

asarray([np.

argmax(line) for line in y_prob])yp_test = ridge.

predict(X_test)test_preds = np.

asarray([np.

argmax(line) for line in yp_test])print(accuracy_score(y_train,y_pred))print(accuracy_score(y_test,test_preds))So, Ridge Regression is not a good model for this data.

K-Nearest NeighborsWe’ll use the same train test split as Ridge for K-Nearest Neighbors.

from sklearn.

neighbors import KNeighborsClassifierknn = KNeighborsClassifier()knn.

fit(X_train,y_train)y_pred = knn.

predict(X_train)print('Train accuracy score:',accuracy_score(y_train,y_pred))print('Test accuracy score:',accuracy_score(y_test,knn.

predict(X_test)))These scores look a lot better than Ridge, but still aren’t our best scores.

XGBoostXGBoost is an algorithm that has been pretty popular in applied machine learning and Kaggle competitions for structured or tabular data.

It is an implementation of gradient boosted decision trees designed for speed and performance.

If you want to read more about it, check out there documentation here.

I played with these parameters quite a bit when running this model and these were the best for the data I was running.

xg_train = xgb.

DMatrix(X_train, label=y_train)xg_test = xgb.

DMatrix(X_test, label=y_test)xg_train.

save_binary('train.

buffer')xg_test.

save_binary('train.

buffer')# setup parameters for xgboostparam = {}# use softmax multi-class classificationparam['objective'] = 'multi:softmax'param['silent'] = 1 # cleans up the outputparam['num_class'] = 3 # number of classes in target labelwatchlist = [(xg_train, 'train'), (xg_test, 'test')]num_round = 30bst = xgb.

train(param, xg_train, num_round, watchlist)The output of the XGBoost classifier outputs an merror which is defined asmerror: Multiclass classification error rate.

It is calculated as #(wrong cases)/#(all cases)# get predictiony_pred1 = bst.

predict(xg_train)y_pred2 = bst.

predict(xg_test)print('Train accuracy score:',accuracy_score(y_train,y_pred1))print('Test accuracy score:',accuracy_score(y_test,bst.

predict(xg_test)))We get .

79 for train and .

78 for test, which also isn’t our best score but is up there with Random Forest.

ConclusionFor my purposes, I chose to go with XGBoost and modified the parameters.

My scores with the train test split data used above was .

97 on train and .

81 on test.

My Kaggle score ended with .

795 on the test data given.

Once you’ve found the model that works best with the data you have, you can play with the parameters the model takes in and see if you can get an even better score.

I hope this helps in your predictive modeling endeavors!.. More details

Leave a Reply