Python For Data Science: From Scratch(Part III)

See, the model doesn’t already know if the recommendation will be a dishwasher or an Air conditioner, it simply learns and provides the result.

Scikit-learn:Scikit-learn is an opensource library that houses several methods that let us perform machine learning on the given data.

It is dependent on SciPy and NumPy libraries so make sure you have all the guns loaded before we start using it.

Though sklearn comes installed when you install Anaconda, still if you face any issues, kindly usepip install sklearnWe shall work on the iris dataset to get more hands-on experience with scikit-learn.

You will learn all the necessary jargons and what they mean as we go I promise.

Imagine there is a passionate botanist, she loves collecting the iris flower.

So she goes into the wild every morning and as she strolls she collects all possible irises she can.

Then when she gets back home she measures the petal’s length and width, also the sepal’s length and width.

Based on these four lengths she distinguishes the collected flowers into either of the three species: Setosa, Versicolor or Virginica.

This means given an input set of measurements our botanist is certain to which species it might belong to.

Let’s assume that only these three classes are available in the wild.

Our mission is to build a machine learning model that can correctly tell us the species of the flower based on the measurements.

Since we already know that the output will be either one of the three classes of irises, this is a supervised learning problem.

Additionally, it is a classification problem as you see we are classifying a given flower into pre-defined classes(species).

More precisely this is a three-class classification problem.

Let’s meet the data:Summon all the necessary libraries as below:You can find the ipynb notebook here.

Importing libraries2.

Import the iris dataset: Find the dataset at UCI repository for Iris or at my Github repository’s dataset folder.

As the dataset itself doesn’t provide the column headers, we first put the column labels in a list called columns and then use the read_csv function to load the dataset along with the columns just declared.

Loading the dataset3.

Understanding the data:Okay, so we have declared the dataset as measurements for petals and sepals and a target class.

But for a layman, what are petals and sepals?Fig.


IRIS classesSepals are the leaves protecting the flower bud and keeping the petals in place.

Petals, as most of us know are the modified leaves that form a flower and surrounds the reproductive part of it.

As for the iris flower, these lengths define the kind of species they belong to.

Shape means dimensions of datasetThe shape() method outputs the number of rows and columns in our dataset.

Here (x,y) where x is the number of rows and y is the number of columns.

describe returns the summary of numeric columnsThe describe() function tells us a summary of the numerical variables.

Here we can see the count of each column is 150, ie.

there are no nulls.

Then we have the mean, std, min, max for each column.


Data visualization: Finally we use what we learned earlier!4.


Univariate Analysis to better understand each variable.

Here we create a boxplot of all the measurements.



 BoxplotWe can see that the sepal length ranges from 4 to 8.

0 units with no outliersWhile sepal with does have a few outliers, exactly three.

None outliers for petal length and width.



Multi-variate analysis to check the relationship between variables.

We use the seaborn package to see this analysis.

Seaborn is another superb plotting package which is actually based on matplotlib.

pairplot() method plots all the given 4 measurements and tries to establish a relationship between them.

Each measurement is compared with every other measurement and with itself.



 PairplotFrom the given chart, we can clearly see that the measurement of each flower is concentrated on a specific range of numbers.

Hence, we can successfully run a classification algorithm and get the desired iris species.


Splitting Data:Before starting to build a machine learning model on these parameters, we need some confirmation that whatever output we produce as the “class” of the flower is correct.

But logically, we cannot use the same data we had used for training the model for testing the model too.

We need unseen fresh data that actually has it’s predicted output class variable.

We are going to run our model over this fresh data and get the result class.

If the result class and the stored output class are same, Voila!.

For this purpose, it is a general rule in machine learning to train any model on 75% of your data calling it the training data or training set.

The rest 25% of the data is the test data, test set or hold-out set.

Python provides an in-built function to handle the splitting, the input takes the two important parameters, that is the input X columns and the output y variable.

The default test_size is 0.

25 that is 25% of whole data is testing data, we used random_state = 0 that enabled us to jumble the data and then randomly choose the splits.

Suppose if we take the last 25% of our data as test data, we would have only iris-virginica outputs.

That would be highly misleading for training, hence the random splitting.


s: It's just a convention to use a capital X for input variables and a lowercase y for outcome variable.


Building the model:Now we move to the final and most interesting step that is to build a machine learning model to learn these values.

For this dataset, we would be using KNN classification model.

KNN Algorithm: KNN stands for K-nearest neighbors.

Have you ever heard of the belief that you are an average of the five people you spend most of your time with!.Apply this analogy to understand KNN algorithm.

KNN considers the neighbors of a given data point and according to what class these neighbors belong to, the given datapoint’s class is decided.

The ‘K’ in this algorithm is the number of neighbors that we are considering for our analysis, say k=3 tells to use 3 neighbors.

KNN algorithm resides in the sklearn.

neighbors package called as KNeighborsClassifier.

For using this algorithm, we need to instantiate an object from the given class.

We called this object as ‘knn’.

For simplicity, we are using just one neighbor for training.

Don’t worry we will work on this soon later.

Using the fit() method, we train the model on our training set.

fit() method takes two arguments, one is the X_train that contains all the measurements and the second argument is y_train that contains all the labels for these measurements.

But training the model isn’t enough, we need to test it!Using the testing vector, X_test we deploy the trained knn model on it.

The score() method calculates the mean accuracy of the given test data.

It takes the testing data points and their testing data labels and outputs the accuracy.

For our model, we got an accuracy of a whopping 97% .

Pat your back fellas!.You deployed your first machine learning model.

Conclusion:First of all congratulations on your first ever machine learning model!.A couple of months ago I patted my back for the same reason????.

With this piece, I conclude the Python for Data Science Series.

This much python is enough for you to sail the rough sea on your own.

I would like to thank the diligent reader and my motive behind these publications is to grow by sharing whatever feeble knowledge I have.

I believe even if one single person is able to learn a single topic from my publications, I’ve given something back to the community.

That’s my biggest achievement!✌You can find the earlier publications here:Python For Data Science Part IPython For Data Science Part III’m glad to announce the next series, that is Machine Learning using Python, These series will cover the sklearn package in depth and we will work on real-life datasets to solve interesting questions using subtle algorithms.

So meet me then, learners!Follow me on LinkedIn.


. More details

Leave a Reply