Time Forecast with TPOT

'.

format(*[len(c) for c in counts]))print('Constant features: ', counts[0])print()print('Categorical features: ', counts[2])Figure 6There were 12 features in which only contain a single value (0), these are useless for supervised algorithms, and we will drop them later.

The rest of our data set is made up of many binary features, and 8 categorical features.

Let’s explore categorical features first.

Categorical Featuresfor cat in ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']: print("Number of levels in category '{0}': {1:2}".

format(cat, train[cat].

nunique()))Figure 7Feature X0sort_X0 = train.

groupby('X0').

size() .

sort_values(ascending=False) .

indexplt.

figure(figsize=(12,6))sns.

countplot(x='X0', data=train, order = sort_X0)plt.

xlabel('X0')plt.

ylabel('Occurances')plt.

title('Feature X0')sns.

despine();Figure 8X0 vs.

target feature ysort_y = train.

groupby('X0')['y'] .

median() .

sort_values(ascending=False) .

indexplt.

figure(figsize = (14, 6))sns.

boxplot(y='y', x='X0', data=train, order=sort_y)ax = plt.

gca()ax.

set_xticklabels(ax.

get_xticklabels())plt.

title('X0 vs.

y value')plt.

show();Figure 9Feature X1sort_X1 = train.

groupby('X1').

size() .

sort_values(ascending=False) .

indexplt.

figure(figsize=(12,6))sns.

countplot(x='X1', data=train, order = sort_X1)plt.

xlabel('X1')plt.

ylabel('Occurances')plt.

title('Feature X1')sns.

despine();Figure 10X1 vs.

target feature ysort_y = train.

groupby('X1')['y'] .

median() .

sort_values(ascending=False) .

indexplt.

figure(figsize = (10, 6))sns.

boxplot(y='y', x='X1', data=train, order=sort_y)ax = plt.

gca()ax.

set_xticklabels(ax.

get_xticklabels())plt.

title('X1 vs.

y value')plt.

show();Figure 11Feature X2sort_X2 = train.

groupby('X2').

size() .

sort_values(ascending=False) .

indexplt.

figure(figsize=(12,6))sns.

countplot(x='X2', data=train, order = sort_X2)plt.

xlabel('X2')plt.

ylabel('Occurances')plt.

title('Feature X2')sns.

despine();Figure 12X2 vs.

target feature ysort_y = train.

groupby('X2')['y'] .

median() .

sort_values(ascending=False) .

indexplt.

figure(figsize = (12, 6))sns.

boxplot(y='y', x='X2', data=train, order=sort_y)ax = plt.

gca()ax.

set_xticklabels(ax.

get_xticklabels())plt.

title('X2 vs.

y value')plt.

show();Figure 13Feature X3sort_X3 = train.

groupby('X3').

size() .

sort_values(ascending=False) .

indexplt.

figure(figsize=(12,6))sns.

countplot(x='X3', data=train, order = sort_X3)plt.

xlabel('X3')plt.

ylabel('Occurances')plt.

title('Feature X3')sns.

despine();Figure 14X3 vs.

target feature ysort_y = train.

groupby('X3')['y'] .

median() .

sort_values(ascending=False) .

indexplt.

figure(figsize = (10, 6))sns.

boxplot(y='y', x='X3', data=train, order = sort_y)ax = plt.

gca()ax.

set_xticklabels(ax.

get_xticklabels())plt.

title('X3 vs.

y value')plt.

show();Figure 15Feature X4sort_X4 = train.

groupby('X4').

size() .

sort_values(ascending=False) .

indexplt.

figure(figsize=(12,6))sns.

countplot(x='X4', data=train, order = sort_X4)plt.

xlabel('X4')plt.

ylabel('Occurances')plt.

title('Feature X4')sns.

despine();Figure 16X4 vs.

target feature ysort_y = train.

groupby('X4')['y'] .

median() .

sort_values(ascending=False) .

indexplt.

figure(figsize = (10, 6))sns.

boxplot(y='y', x='X4', data=train, order = sort_y)ax = plt.

gca()ax.

set_xticklabels(ax.

get_xticklabels())plt.

title('X4 vs.

y value')plt.

show();Figure 17Feature X5sort_X5 = train.

groupby('X5').

size() .

sort_values(ascending=False) .

indexplt.

figure(figsize=(12,6))sns.

countplot(x='X5', data=train, order = sort_X5)plt.

xlabel('X5')plt.

ylabel('Occurances')plt.

title('Feature X5')sns.

despine();Figure 18X5 vs.

target feature ysort_y = train.

groupby('X5')['y'] .

median() .

sort_values(ascending=False) .

indexplt.

figure(figsize = (12, 6))sns.

boxplot(y='y', x='X5', data=train, order=sort_y)ax = plt.

gca()ax.

set_xticklabels(ax.

get_xticklabels())plt.

title('X5 vs.

y value')plt.

show();Figure 19Feature X6sort_X6 = train.

groupby('X6').

size() .

sort_values(ascending=False) .

indexplt.

figure(figsize=(12,6))sns.

countplot(x='X6', data=train, order = sort_X6)plt.

xlabel('X6')plt.

ylabel('Occurances')plt.

title('Feature X6')sns.

despine();Figure 20X6 vs.

target feature ysort_y = train.

groupby('X6')['y'] .

median() .

sort_values(ascending=False) .

indexplt.

figure(figsize = (12, 6))sns.

boxplot(y='y', x='X6', data=train, order=sort_y)ax = plt.

gca()ax.

set_xticklabels(ax.

get_xticklabels())plt.

title('X6 vs.

y value')plt.

show();Figure 21Feature X8sort_X8 = train.

groupby('X8').

size() .

sort_values(ascending=False) .

indexplt.

figure(figsize=(12,6))sns.

countplot(x='X8', data=train, order = sort_X8)plt.

xlabel('X8')plt.

ylabel('Occurances')plt.

title('Feature X8')sns.

despine();Figure 22X8 vs.

target feature ysort_y = train.

groupby('X8')['y'] .

median() .

sort_values(ascending=False) .

indexplt.

figure(figsize = (12, 6))sns.

boxplot(y='y', x='X8', data=train, order=sort_y)ax = plt.

gca()ax.

set_xticklabels(ax.

get_xticklabels())plt.

title('X8 vs.

y value')plt.

show();Figure 23Unfortunately, we did not learn much from the above EDA, this is life.

However, we did notice that some categorical features have effects on the “y” and the “X0” seems to have the highest effect.

After exploring, we now going to encode these categorical features’ levels as digits using Scikit-learn’s MultiLabelBinarizer and treat them as new features.

encode_cat.

pyWe then drop the constant features and categorical features which have been encoded, as well as our target feature “y”.

train_new = train.

drop(['y','X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], axis=1)We then add the encoded features to form the final data set to be used with TPOT.

train_new = np.

hstack((train_new.

values, X0_trans, X1_trans, X2_trans, X3_trans, X4_trans, X5_trans, X6_trans, X8_trans))The final data set is in the form of a numpy array, in the shape of (4209, 552).

TPOTIt’s time to construct and fit TPOT regressor.

When it is finished, TPOT will display the “best” model (based on test data MSE in our case) hyperparameters, and will also output the pipelines as an execution-ready Python script file for a later use or investigation.

TPOT_Mercedes_regressor.

pyFigure 24Running above code will discover a pipeline as output that achieves 56 mean squared error (MSE) on the test set:print("TPOT cross-validation MSE")print(tpot.

score(X_test, y_test))You may have noticed that MSE is a negative number, according to this thread, it is neg_mean_squared_error for TPOTRegressor that stands for negated value of mean squared error.

Let’s try it again.

from sklearn.

metrics import mean_squared_errorprint('MSE:')print(mean_squared_error(y_test, tpot.

predict(X_test)))print('RMSE:')print(np.

sqrt(mean_squared_error(y_test, tpot.

predict(X_test))))So, the difference between our predicted time and the real time is about 7.

5 seconds.

Not a bad result at all.And the model that produces this result is one that fits a RandomForestRegressor stacked with KNeighborsRegressor algorithm on the data set.

Finally, we are going to export this pipeline:tpot.

export('tpot_Mercedes_testing_time_pipeline.

py')tpot_Mercedes_testing_time_pipeline.

pyI enjoyed learning and using TPOT, hope you are the same.

Jupyter notebook can be found on Github.

Have a great weekend!Reference: TPOT Tutorial.. More details