Applying the Universal Machine Learning Workflow to the UCI Mushroom Dataset

Lepiota Mushrooms – Image Credit : East Tennessee WildflowersApplying the Universal Machine Learning Workflow to the UCI Mushroom DatasetThis post is intended to demonstrate the universal machine learning workflow as stated by Francois Chollet in Deep Learning with Python.

Matt KirbyBlockedUnblockFollowFollowingMay 17W e will be using the Mushroom Dataset from UCI’s Machine Learning Repository to perform our demonstration.

This work is meant for a reader who has at least a basic understanding of Python fundamentals and some experience with machine learning.

That being said, I will provide copious links to supporting sources for the uninitiated so that anyone can make use of the information presented.

Before we get started, I’d like to give thanks to my fellow Lambdonian Ned H for all his help on this post.

The Universal Machine Learning WorkflowDefine the problem and assemble a datasetChoose a measure of successDecide on an evaluation protocolPrepare the dataDevelop a model that does better than a baselineDevelop a model that overfitsRegularize the model and tune its hyperparameters1.

Define the problem and assemble a datasetStated concisely our problem is the binary classification of a mushroom as edible or poisonous.

We are given a dataset with 23 features including the class (edible or poisonous) of the mushroom.

From the features listed in the data information file we can create a list of column names for our dataset.

column_names = ['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']Lets import our dataset and create a Pandas DataFrame from the .

data file using pd.

read_csv()import pandas as pdurl = 'https://archive.

ics.

uci.

edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.

data'mushrooms = pd.

read_csv(url, header=None, names=column_names)2.

Choose a measure of successGiven the nature of our problem; classifying whether or not a mushroom is poisonous or not, we will be using precision as our measure of success.

Precision is the ability of the classifier not to label as edible mushrooms which are poisonous.

We would much rather people discard edible mushrooms that our model classified as poisonous than eat poisonous mushrooms our classifier labeled as edible.

from sklearn.

metrics import precision_score3.

Decide on an evaluation protocolWe will be using 10-fold cross validation to evaluate our model.

While a simple holdout validation set would probably suffice, I am skeptical of its viability given our ~8,000 samples.

from sklearn.

model_selection import train_test_split, cross_validateFirst lets split our data into a feature matrix (X), and a target vector (y).

We will use OneHotEncoder to encode our categorical variables.

import category_encoders as ce#Drop target featureX = mushrooms.

drop(columns='class') #Encode categorical featuresX = ce.

OneHotEncoder(use_cat_names=True).

fit_transform(X) y = mushrooms['class'].

replace({'p':0, 'e':1})print('Feature matrix size:',X.

shape)print('Target vector size:',len(y))____________________________________________________________________Feature matrix size: (8124, 117) Target vector size: 8124Next we will split our data into a training set and a test set.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.

2, stratify=y)print('Training feature matrix size:',X_train.

shape)print('Training target vector size:',y_train.

shape)print('Test feature matrix size:',X_test.

shape)print('Test target vector size:',y_test.

shape)____________________________________________________________________Training feature matrix size: (6499, 117) Training target vector size: (6499,) Test feature matrix size: (1625, 117) Test target vector size: (1625,)4.

Prepare the dataWere almost ready to begin training models, but first we should explore our data, familiarize ourselves with its characteristics, and format it so that it can be fed into our model.

We could use .

dtypes(), .

columns, and .

shape to examine our dataset, but Pandas provides a .

info function that will allow us to view all this information in one place.

print(mushrooms.

info())____________________________________________________________________<class 'pandas.

core.

frame.

DataFrame'> RangeIndex: 8124 entries, 0 to 8123Data columns (total 23 columns):class 8124 non-null objectcap-shape 8124 non-null object cap-surface 8124 non-null object cap-color 8124 non-null object bruises?.8124 non-null object odor 8124 non-null object gill-attachment 8124 non-null object gill-spacing 8124 non-null object gill-size 8124 non-null object gill-color 8124 non-null object stalk-shape 8124 non-null object stalk-root 8124 non-null object stalk-surface-above-ring 8124 non-null object stalk-surface-below-ring 8124 non-null object stalk-color-above-ring 8124 non-null object stalk-color-below-ring 8124 non-null object veil-type 8124 non-null object veil-color 8124 non-null object ring-number 8124 non-null object ring-type 8124 non-null object spore-print-color 8124 non-null object population 8124 non-null object habitat 8124 non-null object dtypes: object(23) memory usage: 1.

4+ MB NoneAnother useful step is to check is the number of null values and where they are in the DataFrameprint(mushrooms.

isna().

sum())____________________________________________________________________class 0 cap-shape 0 cap-surface 0 cap-color 0 bruises?.0 odor 0 gill-attachment 0 gill-spacing 0 gill-size 0 gill-color 0 stalk-shape 0 stalk-root 0 stalk-surface-above-ring 0 stalk-surface-below-ring 0 stalk-color-above-ring 0 stalk-color-below-ring 0 veil-type 0 veil-color 0 ring-number 0 ring-type 0 spore-print-color 0 population 0 habitat 0 dtype: int64None… that seems a bit too good to be true.

Since we were studious and read the dataset information file.

We’re aware that all missing values are marked with a question mark.

Once this is clear we can use df.

replace() to convert the ?.to NaNs.

import numpy as npmushrooms = mushrooms.

replace({'?':np.

NaN})print(mushrooms.

isna().

sum())____________________________________________________________________class 0 cap-shape 0 cap-surface 0 cap-color 0 bruises?.0 odor 0 gill-attachment 0 gill-spacing 0 gill-size 0 gill-color 0 stalk-shape 0 stalk-root 2480stalk-surface-above-ring 0 stalk-surface-below-ring 0 stalk-color-above-ring 0 stalk-color-below-ring 0 veil-type 0 veil-color 0 ring-number 0 ring-type 0 spore-print-color 0 population 0 habitat 0 dtype: int64There we are, stalk_root has 2480 blank features, lets replace these with m for missing.

mushrooms['stalk-root'] = mushrooms['stalk-root'].

replace(np.

NaN,'m')print(mushrooms['stalk-root'].

value_counts())____________________________________________________________________b 3776 m 2480 e 1120 c 556 r 192 Name: stalk-root, dtype: int645.

Develop a model that does better than a baselineBaseline ModelUsing the most common label from our dataset we will create a baseline model that we hope to beat.

First let's look at how class is distributed using df.

value_counts()mushrooms['class'].

value_counts(normalize=True)____________________________________________________________________e 0.

517971 p 0.

482029 Name: class, dtype: float64We will use the mode of the class attribute to create our baseline prediction.

majority_class = y_train.

mode()[0]baseline_predictions = [majority_class] * len(y_train)Lets see how accurate our baseline model is.

from sklearn.

metrics import accuracy_scoremajority_class_accuracy = accuracy_score(baseline_predictions, y_train)print(majority_class_accuracy)____________________________________________________________________0.

5179258347438067~52%… Which is what we would expect given the distribution of class in our initial dataset.

Decision TreeWe will attempt to fit a decision tree to our training data and produce an accuracy score greater than 52%.

from sklearn.

tree import DecisionTreeClassifierimport graphvizfrom sklearn.

tree import export_graphviztree = DecisionTreeClassifier(max_depth=1)# Fit the modeltree.

fit(X_train, y_train)# Visualize the treedot_data = export_graphviz(tree, out_file=None, feature_names=X_train.

columns, class_names=['Poisonous', 'Edible'], filled=True, impurity=False, proportion=True)graphviz.

Source(dot_data)Now that we have fitted the decision tree to our data we can analyze our model by looking at the prediction probability distribution for our classifier.

In simple terms, prediction probability represents how sure the model is about its classification label.

In addition to prediction probability, we will look at the precision score of our decision tree.

Sklearn provides us with a simple way to see many of the relevant scores for classification models with classification_report.

We will also generate a confusion matrix using sklearn’s confusion_matrix.

A confusion matrix shows the number of true and false positives and negatives.

Since we will be using these tools again we will write a function to run our model analysis for us.

import matplotlib.

pyplot as pltimport seaborn as snsfrom sklearn.

metrics import classification_report, confusion_matrixdef model_analysis(model, train_X, train_y): model_probabilities = model.

predict_proba(train_X) Model_Prediction_Probability = [] for _ in range(len(train_X)): x = max(model_probabilities[_]) Model_Prediction_Probability.

append(x) plt.

figure(figsize=(15,10)) sns.

distplot(Model_Prediction_Probability) plt.

title('Best Model Prediction Probabilities') # Set x and y ticks plt.

xticks(color='gray') #plt.

xlim(.

5,1) plt.

yticks(color='gray') # Create axes object with plt.

get current axes ax = plt.

gca() # Set grid lines ax.

grid(b=True, which='major', axis='y', color='black', alpha=.

2) # Set facecolor ax.

set_facecolor('white') # Remove box ax.

spines['top'].

set_visible(False) ax.

spines['right'].

set_visible(False) ax.

spines['bottom'].

set_visible(False) ax.

spines['left'].

set_visible(False) ax.

tick_params(color='white') plt.

show(); model_predictions = model.

predict(train_X) # Classification Report print('.', classification_report(train_y, model_predictions, target_names=['0-Poisonous', '1-Edible'])) # Confusion Matrix con_matrix = pd.

DataFrame(confusion_matrix(train_y, model_predictions), columns=['Predicted Poison', 'Predicted Edible'], index=['Actual Poison', 'Actual Edible']) plt.

figure(figsize=(15,10)) sns.

heatmap(data=con_matrix, cmap='cool'); plt.

title('Model Confusion Matrix') plt.

show(); return con_matrixNow to apply this function to our decision tree.

model_analysis(tree, X_train, y_train)We will store our predictions as a tree_predictions variable for use in interpreting our models accuracy.

tree_predictions = tree.

predict(X_train)accuracy_score(y_train, tree_predictions)____________________________________________________________________0.

886290198492075788% accuracy isn’t bad, but let's move on to the next step in our workflow.

6.

Develop a model that overfitsWe will use the RandomForestClassifier for our overfitting model.

from sklearn.

ensemble import RandomForestClassifierrandom_forest = RandomForestClassifier(n_estimators=100, max_depth=5)cv = cross_validate(estimator = random_forest, X = X_train, y = y_train, scoring='accuracy', n_jobs=-1, cv=10, verbose=10, return_train_score=True)____________________________________________________________________[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.

[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 2.

6s [Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 3.

2s [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 4.

7s finishedNow we can see our random forest’s accuracy score.

random_forest.

fit(X_test, y_test)test_predictions = random_forest.

predict(X_train)accuracy_score(y_train, test_predictions)____________________________________________________________________0.

992460378519772399% accuracy looks overfitted to me.

We can use our model_analysis function from earlier to analyze our model.

model_analysis(random_forest, X_train, y_train)7.

Regularize the model and tune its hyperparametersNow we will tune the hyperparameters of our RandomForestClassifier and attempt to walk the line between underfitting and overfitting.

We can use sklearn’s RandmoizedSearchCV to search the hyperparameters in our param_distributions dictionary.

from sklearn.

model_selection import RandomizedSearchCVparam_distributions = { 'max_depth':[1, 2, 3, 4, 5], 'n_estimators': [10, 25, 50, 100, 150, 200]}search = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = param_distributions, n_iter=100, scoring='precision', n_jobs=-1, cv=10, verbose=10, return_train_score=True) search.

fit(X_train, y_train)We can use search.

best_estimator_ to see which model has the highest precision score.

best_model = search.

best_estimator_best_model____________________________________________________________________RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=5, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.

0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.

0, n_estimators=10, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)From the model description we can see that a RandomForestClassifier with a max_depth of 5 and 10 estimators is our optimal model.

Now we can run our analysis function.

model_analysis(best_model, X_test, y_test)3 false positives, not perfect, but pretty good.

ConclusionTo restate our workflow.

Define the problem and assemble a datasetChoose a measure of successDecide on an evaluation protocolPrepare the dataDevelop a model that does better than a baselineDevelop a model that overfitsRegularize the model and tune its hyperparametersWhile Chollet describes this as THE universal machine learning workflow, there are infinite variations depending on the specific problem we are trying to solve.

In general though, you will always start with defining your problem and collecting data, (whether that be from a premade dataset or doing your own data collection).

I hope this post has presented an informative walkthrough of Chollet’s universal machine learning workflow.

Thanks for reading!Follow me on Twitter, GitHub, and LinkedInP.

S.

Here is the link to the Colab Notebook I used for this post.

.

. More details

Leave a Reply