Cracking the handwritten digits recognition problem with Scikit-learn

in ML:There are some scanned images (that are actually handwritten digits in this case) in a dataset and for each image we know the digit they represent.

For instance, there’s an image and we know that it represents number 5.

The problem statement is: learn from the whole dataset all the digits and predict as accurate as possible unseen digits (new digits that are not part of the dataset).

A handwritten digit dataset (a.

k.

a.

MNIST dataset)On this blog, you are going to solve this problem from scratch, by installing all the tools you need, understanding the code and using a very well-known library called Scikit-Learn.

Let’s get started!Photo by Damian Zaleski on UnsplashTools neededWe will have to install some tools in order to code and run the example.

You may skip this section if you already have an environment to code in Python.

PyCharmIt’s quite common to use Jupyter Notebooks when doing some experiments around Python or ML.

However, when building microservices for ML, it’s way better to use an IDE, as you will probably need to code in a more robust environment.

The best python IDE I know is Pycharm.

Installing Pycharm is pretty easy.

There’s an installer for most of the OSs.

Note: As of now, there are two versions of Pycharm.

Community (free) and Enterprise (paid).

The community edition works pretty well for the examples we’ll see, but if your application gets really big, you may consider buying the enterprise license.

CondaMost programming languages have Package Managers.

They’re useful to install dependencies and automate tasks, among other things.

A classic package manager for python is Pip.

However, Conda, which is another package manager, works a bit better since it also downloads dependencies that are “external” to python (for instance a dataset or a configuration file).

Another advantage that Conda has is that it creates virtual environments, which allows you to switch contexts really fast (for example, a different version of a dependency or even a different Python version).

To install Conda you can follow these steps.

To configure Pycharm to use Conda, you can follow this post.

Introduction to Scikit-Learn (a.

k.

a Sklearn)Sklearn is an ML library for Python that provides built-in solutions for data analysis in general.

Among these solutions, there are components for Classification, Regression and Clustering problems (in this article we’ll focus on classification only, regression and clustering will be covered in future blogs).

It also provides tools for data cleaning and preprocessing, which is normally used to remove unwanted fields, duplicates or even change formats (i.

e.

date or number formats).

Data cleaning is out of scope but it will be covered in future blogs as well.

Sklearn has a super clear API, which is documented here.

Basically, most of the components have the same methods:.

fit(): used to estimate some parameters based on a dataset.

These parameters will later allow us to make predictions of unseen data.

transform(): used to transform a dataset (i.

e.

word tokenization).

predict(): used to make predictions of unknown samples.

Note: Some estimators also have a method called fit_transform() that runs transform() and then fit().

Normally, there are some computational optimizations done around this, which makes fit_transform() faster.

The design of this API is great because one may change components or models and, syntactically, the rest of the code will be correct:model_1.

fit()If we decided to use another model, i.

e model_2, instead of model_1, all we would have to do is:model_2.

fit()Of course, the behavior may be different, but the syntaxis will still be valid.

Sklearn Hello World!The example we’ll run is pretty simple: learn to recognize digits.

Given a dataset of digits, learn the shape of them and predict unseen digits.

This example is based on the Sklearn basic tutorial.

Verify your Python configurationBefore we move forward, just run a simple Python file to make sure you have configured everything properly.

Open PyCharmCreate a new projectCreate a Python fileAdd the following line into it: print("Running Sklearn Hello World!")Run the file.

You should see that string in the console.

Import datasetsSklearn has some built-in datasets that allow you to get started quickly.

You could download the dataset from somewhere else if you want to, but in this blog, we’ll use Sklearn’s datasets.

Note: How digits are transformed from images into pixels is out of the scope of this blog.

Assume that someone did a transformation to get pixels from scanned images, and that’s your dataset.

Edit your Python file and before the print command, add the following import:from sklearn import datasetsExplore the dataset:digits = datasets.

load_digits()print(digits.

data)3.

Run your Python file.

You should see the following output in the console:[[ 0.

0.

5.

.

0.

0.

0.

] [ 0.

0.

0.

.

10.

0.

0.

] [ 0.

0.

0.

.

16.

9.

0.

] .

[ 0.

0.

1.

.

6.

0.

0.

] [ 0.

0.

2.

.

12.

0.

0.

] [ 0.

0.

10.

.

12.

1.

0.

]]What you’re seeing in that output are all the digits (or instances) and all their features that each instance has.

In this example, the pixels of each digit.

If we printed the valuedigits.

target instead, we would see the real values (classifications) for those digits: array([0, 1, 2, …, 8, 9, 8]).

Features are attributes about an instance.

A person may have attributes like nationality, skills, etc.

Instead of calling them attributes, they’re called features.

In our case, our instances (digits) has the brightness levels of each pixel as attributes or features.

Learn from our datasetML is about generalizing the behavior of our dataset.

It’s like taking a look at the data and saying something like “yes, it seems that next month we’ll increase our sales”.

That’s because based on what happened, you’re trying to generalize the situation and predict what may happen in the future.

There are basically two ways of generalizing from data:Learning by heart: this means “memorizing” all the instances and then try to match new instances to the ones we knew.

A good example of this is explained in [1]: If we had to implement a spam filter, one way could be flagging all emails that are identical to emails already flagged as spam.

The similarity between emails could be the number of words they have in common with a known spam email.

Building a model to represent data: this implies building a model that approximates known values with unseen values.

The general idea is that if we know that instances A and B are similar and A has a target value 1, then we can guess that B may have a target value 1 as well.

The difference with the first approach is that by building a model, we’re adjusting it to represent the data and then we forget about the instances.

A cat-dogs classifier.

In our case, we’ll classify by digit: 0, 1, 2, etc.

 SourceLet’s create a model that represents our data behavior.

As this is a classification problem (given some instances, we want to classify them based on their features and predict the digit they represent), we will call our component classifier and we’ll choose a Support Vector Machine (SVM).

There are many other classifiers in Sklearn, but this one will be enough for our use case.

For further details on when to use certain components depending on the problem, you can follow the following cheat-sheet:ML map.

Depending on the problem, you may choose different components.

SourceImport the classifier class:from sklearn import svmNow create a new instance of the classifier:clf = svm.

SVC(gamma=0.

001, C=100)You must have noticed there are two parameters in the constructor called gamma and C.

In machine learning, these parameters are called hyperparameters and they’re the ones in charge of parameterizing our learning or training process.

They’re useful to adjust our model to our dataset.

Although in this example they’re hardcoded, they’re usually adjusted dynamically by training several times and testing the output model with a testing or validation dataset.

One way for discovering the optimal values of hyperparameters are combining grid search with k-fold cross-validation.

Sometimes we need to preprocess our data (i.

e clean it up).

In this case, the dataset is ready to use so we’re skipping this step.

Once our classifier is created, we can use it to build or fit a new model:clf.

fit(digits.

data[:-1], digits.

target[:-1])As this learning is supervised (because we’re telling the model what the target values are, i.

e the digit), we need to pass the features and the target values to the classifier.

Now we can use our classifier to predict instances:print("Predicting")digit_to_test = 1some_digit = digits.

data[digit_to_test]digit_target = digits.

target[digit_to_test]prediction = clf.

predict([some_digit])print(prediction)print("Real Value ", digit_target)In the console you will see the following output:>>> Predicting>>> [1]>>> Real Value [1]This means that the classifier guesses correctly the target value for that instance.

That was easy… but too magical!.Let’s parse the pixels to render the digit, so we can validate that it’s actually a 1[1]:Add the following imports:import matplotlibfrom matplotlib import pyplot as pltAnd then the following code after predicting:some_digit_image = some_digit.

reshape(8, 8)plt.

imshow(some_digit_image, cmap=matplotlib.

cm.

binary, interpolation="nearest")plt.

axis("off")plt.

show()Run the code and you will see the following output:Do you agree it’s a 1 as well? :)[1]: “Hands-on Machine Learning with Scikit-Learn & TensorFlow”, Géron Aurélien, 2017.

ConclusionWe’ve seen how to address an ML problem from scratch using Sklearn.

And it only took a few lines of code!On Github, you’ll find the full example.

If you liked this post, please share it with your friends and colleagues!Thanks to Martin Rey for collaborating on this blog and to Lautaro Petaccio for the feedback.

.

. More details

Leave a Reply