# The Complete Guide to Classification in Python

Let’s get started!Regression Versus Classification ProblemsPreviously, we saw that linear regression assumes the response variable is quantitative.

However, in many situations, the response is actually qualitative, like the color of the eyes.

This type of response is known as categorical.

Classification is the process of predicting a qualitative response.

Methods used for classification often predict the probability of each of the categories of a qualitative variable as the basis for making the classification.

In a certain way, they behave like regression methods.

With classification, we can answer questions like:A person has a set of symptoms that could be attributed to one of three medical conditions.

Which one?Is a transaction fraudulent or not?Categorical responses are often expressed as words.

Of course, we cannot use words as input data for traditional statistical methods.

We will see how to deal with that when we get to implement the algorithms.

For now, let’s see how logistic regression works.

Logistic RegressionWhen it comes to classification, we are determining the probability of an observation to be part of a certain class or not.

Therefore, we wish to express the probability with a value between 0 and 1.

A probability close to 1 means the observation is very likely to be part of that category.

In order to generate values between 0 and 1, we express the probability using this equation:Sigmoid functionThe equation above is defined as the sigmoid function.

Plot this equation and you will see that this equation always results in a S-shaped curve bound between 0 and 1.

Logistic regression curveAfter some manipulation to equation above, you find that:Take the log on both sides:The equation above is known as the logit.

As you can see, it is linear in X.

Here, if the coefficients are positive, then an increase in X will result in a higher probability.

Estimating the coefficientsAs in linear regression, we need a way to estimate the coefficients.

For that, we maximize the likelihood function:Likelihood functionThe intuition here is that we want coefficients such that the predicted probability (denoted with an apostrophe in the equation above) is as close as possible to the observed state.

Similarly to linear regression, we use the p-value to determine if the null hypothesis is rejected or not.

The Z-statistic is also widely used.

A large absolute Z-statistic means that the null hypothesis is rejected.

Remember that the null hypothesis states: there is not correlation between the features and the target.

Multiple logistic regressionOf course, logistic regression can easily be extended to accommodate more than one predictor:Multiple logistic regressionNote that using multiple logistic regression might give better results, because it can take into account correlations among predictors, a phenomenon known as confounding.

Also, rarely will only one predictor be sufficient to make an accurate model for prediction.

Linear Discriminant AnalysisNow, we understand how logistic regression works, but like any model, it presents some flaws:When classes are well separated, parameters estimate from logistic regression tend to be unstableWhen the data set is small, logistic regression is also unstableNot the best to predict more than two classesThat’s where linear discriminant analysis (LDA) comes in handy.

It is more stable than logistic regression and widely used to predict more than two classes.

The particularity of LDA is that it models the distribution of predictors separately in each of the response classes, and then it uses Bayes’ theorem to estimate the probability.

Alright, that’s a bit hard to understand.

Let’s break it down.

Bayes’ theorem for classification(Sorry, Medium doesn’t support math equations.

I tried my best to be as explicit as possible).

Suppose we want to classify an observation into one of K classes, where K is greater than or equal to 2.

Then, let pi-k be the overall probability that an observation is associated to the kth class.

Then, let f_k(X) denote the density function of X for an observation that comes from the kth class.

This means that f_k(X) is large if the probability that an observation from the kth class has X = x.

Then, Bayes’ theorem states:Bayes’ theorem for classificationThe equation above can simply be abbreviated to:Abbreviated Bayes’ theorem for classificationHopefully, this makes some sense!The challenge here is to estimate the density function.

Theoretically, Bayes’ classification has the lowest error rate.

Therefore, our classifier needs to estimate the density function such as to approach the Bayes’ classifier.

LDA for one predictorSuppose we only have one predictor and that the density function normal.

Then, you can express the density function as:Normal distribution functionNow, we want to assign an observation X = x for which the P_k(X) is the largest.

If you plug in the density function in P_k(X) and take the log, you find that you wish to maximize:Discriminant equationThe equation above is called the discriminant.

As you can see, it is a linear equation.

Hence the name: linear discriminant analysis!Now, assuming only two classes with equal distributions, you find:Boundary equationThis is the boundary equation.

A graphical representation is shown hereunder.

Boundary line to separate 2 classes using LDAOf course, this represents an ideal solution.

In reality, we cannot exactly calculate the boundary line.

Therefore, LDA makes use of the following approximation:For the average of all training observationsAverage of all training observationsFor the weighted average of sample variances for each classWeighted average of sample variances for each classWhere n is the number of observations.

It is important to know that LDA assumes a normal distribution for each class, a class-specific mean, and a common variance.

LDA for more than one predictorExtending now for multiple predictors, we must assume that X is drawn from a multivariate Gaussian distribution, with a class-specific mean vector, and a common covariance matrix.

An example of a correlated and uncorrelated Gaussian distribution is shown below.

Left: Uncorrelated normal distribution.

Right: correlated normal distributionNow, expressing the discriminant equation using vector notation, we get:Discriminant equation with matrix notationAs you can see, the equation remains the same.

Only this time, we are using vector notation to accommodate many predictors.

How to assess the performance of the modelWith classification, it is sometimes irrelevant to use accuracy to assess the performance of a model.

Consider analyzing a highly imbalanced data set.

For example, you are trying to determine if a transaction is fraudulent or not, but only 0.

5% of your data set contains a fraudulent transaction.

Then, you could predict that none of the transactions will be fraudulent, and have a 99.

5% accuracy score!.Of course, this is a very naive approach that does not help detect fraudulent transactions.

So what do we use?Usually, we use sensitivity and specificity.

Sensitivity is the true positive rate: the proportions of actual positives correctly identified.

Specificity is the true negative rate: the proportion of actual negatives correctly identified.

Let’s give some context to better understand.

Using the fraud detection problem, the sensitivity is the proportion of fraudulent transactions identified as fraudulent.

The specificity is the proportion of non-fraudulent transactions identified as non-fraudulent.

Therefore, in an ideal situation, we want both a high sensitivity and specificity, although that might change depending on the context.

For example, a bank might want to prioritize a higher sensitivity over specificity to make sure it identifies fraudulent transactions.

The ROC curve (receiver operating characteristic) is good to display the two types of error metrics described above.

The overall performance of a classifier is given by the area under the ROC curve (AUC).

Ideally, it should hug the upper left corner of the graph, and have an area close to 1.

Example of a ROC curve.

The straight line is a base modelQuadratic Discriminant Analysis (QDA)Here, we keep the same assumptions as for LDA, but now, each observation from the kth class has its own covariance matrix.

For QDA, the discriminant is expressed as:Discriminant equation for QDAWithout any surprises, you notice that the equation is now quadratic.

But, why choose QDA over LDA?QDA is a better option for large data sets, as it tends to have a lower bias and a higher variance.

On the other hand, LDA is more suitable for smaller data sets, and it has a higher bias, and a lower variance.

ProjectGreat!.Now that we deeply understand how logistic regression, LDA, and QDA work, let’s apply each algorithm to solve a classification problem.

MotivationMushrooms simply taste great!.But with over 10 000 species of mushrooms only in North America, how can we tell which are edible?This is the objective of this project.

We will build a classifier that will determine if a certain mushroom is edible or poisonous.

I suggest you grab the data set and follow along.

If you ever get stuck, feel free to consult the full notebook.

Let’s get to it!A GIF of a coding mushroom… What are the odds of finding that!Exploratory data analysisThe data set we will be using contains 8124 instances of mushrooms with 22 features.

Among them, we find the mushroom’s cap shape, cap color, gill color, veil type, etc.

Of course, it also tells us if the mushroom is edible or poisonous.

Let’s import some of the libraries that will help us import the data and manipulate it.

In your notebook, run the following code:A common first step for a data science project is to perform an exploratory data analysis (EDA).

This step usually involves learning more about the data you are working with.

You might want to know the shape of your data set (how many rows and columns), the number of empty values and visualize parts of the data to better understand the correlation between the features and the target.

Import the data and see the first five columns with the following code:It’s always good to have the data set in a data folder within the project directory.

Furthermore, we store the file path in a variable, such that if the path ever changes, we only have to change the variable assignment.

After running this code cell, you should see the first five rows.

You notice that each feature is categorical, and a letter is used to define a certain value.

Of course, the classifier cannot take letters as input, so we will have to change that eventually.

For now, let’s see if our data set is unbalanced.

An unbalanced data set is when one class is much more present than the other.

Ideally, in the context of classification, we want an equal number of instances of each class.

Otherwise, we would need to implement advanced sampling methods, like minority oversampling.

In our case, we want to see if there is an equal number of poisonous and edible mushrooms in the data set.

We can plot the frequency of each class like this:And you get the following graph:Count of each classAwesome!.It looks like a fairly balanced data set with an almost equal number of poisonous and edible mushrooms.

Now, I wanted to see how each feature affects the target.

To do so, for each feature, I made a bar plot of all possible values separated by the class of mushroom.

Doing it manually for all 22 features makes no sense, so we build this helper function:The hue will give a color code to the poisonous and edible class.

The data parameter will contain all features but the mushroom’s class.

Running the cell code below:You should get a list of 22 plots.

Here’s an example of the output:Cap surfaceTake some time to look through all the plots.

Now, let’s see if we have any missing values.

Run this piece of code:And you should see each column with the number of missing values.

Luckily, we have a data set with no missing values.

This is very uncommon, but we won’t complain.

Getting ready for modellingNow that we are familiar with the data, it is time to get it ready for modelling.

As mentioned before, the features have letters to represent the different possible values, but we need to turn them into numbers.

To achieve that, we will use label encoding and one-hot encoding.

Let’s first use label encoding on the target column.

Run the following code:And you notice now that the column now contains 1 and 0.

Result of label encoding the ‘class’ columnNow, poisonous is represented by 1 and edible is represented by 0.

Now, we can think of our classifier as “poisonous or not”.

A poisonous mushroom gets a 1 (true), and an edible mushroom gets a 0 (false).

Hence, label encoding will turn a categorical feature into numerical.

However, it is not recommended to use label encoding when there are more than two possible values.

Why?Because it will then assign each value to either 0, 1 or 2.

This is a problem, because the “2” could be considered as being more important and false correlations could be drawn from that.

To avoid this problem, we use one-hot encoding on the other features.

To understand what it does, let’s consider the cap shape of the first entry point.

You see it has a value of “x”, which stands for a convex cap shape.

However, there is a total of six different cap shapes recorded in the data set.

If we one-hot encode the feature, we should get:One-hot encoding the “cap-shape” featureAs you can see, the cap shape is now a vector.

A 1 denotes the actual cap shape value for an entry in the data set, and the rest is filled with 0.

Again, you can think of 1 as true and 0 as false.

The drawback of one-hot encoding is that it introduces more columns to the data set.

In the case of cap shape, we go from one column to six columns.

For very large data sets, this might be a problem, but in our case, the additional columns should be manageable.

Let’s go ahead and one-hot encode the rest of the features:And you should now see:One-hot encoded data setYou notice that we went from 23 columns to 118.

It is a five fold increase, but the number is not high enough to cause computer memory issues.

Now that our data set contains only numerical data, we are ready to start modelling and making predictions!Train/test splitBefore diving deep into modelling and making predictions, we need to split our data set into a training set and test set.

That way, we can train an algorithm on the training set, and make predictions on the test set.

The error metrics will be much more relevant this way, since the algorithm will make predictions on data it has not seen before.

We can easily split the data set like so:Here, y is simply the target (poisonous or edible).

Then, X is simply all features of the data set.

Finally, we use the train_test_split function.

The test_size parameter corresponds to the fraction of the data set that will be used for testing.

Usually, we use 20%.

Then, the random_state parameter is used for reproducibility.

It can be set to any number, but it will ensure that every time the code runs, the data set will be split identically.

If no random_state is provided, then the train and test set will differ, since the function splits it randomly.

All right, we are officially ready to start modelling and making predictions!Logistic regressionWe will first use logistic regression.

Throughout the following steps, we will use the area under the ROC curve and a confusion matrix as error metrics.

Let’s import all we need first:Then, we make an instance of the LogisticRegression object and fit the model to the training set:Then, we predict the probability that a mushroom is poisonous.

Remember, we treat the mushrooms as being poisonous or non-poisonous.

Also, you must be reminded that logistic regression returns a probability.

For now, let’s set the threshold to 0.

5 That way, if the probability is greater than 0.

5, a mushroom will be classified as poisonous.

Of course, if the probability is less than the threshold, the mushroom is classified as edible.

This is exactly what is happening in the code cell below:Notice that we calculated the probabilities on the test set.

Now, let’s see the confusion matrix.

This will show us the true positive, true negative, false positive and false negative rates.

Example of a confusion matrixWe output our confusion matrix like so:And you should get:Amazing!.Our classifier is perfect!.From the confusion matrix above, you see that our false positive and false negative rates are 0, meaning that all mushrooms were correctly classified as poisonous or edible!Let’s print the area under the ROC curve.

As you know, for a perfect classifier, it should be equal to 1.

Indeed, the code block above outputs 1!.We can make our own function to visualize the ROC curve:And you should see:ROC curveCongratulations!.You built a perfect classifier with a basic logistic regression model.

Still, to gain more experience, let’s build a classifier using LDA and QDA, and see if we get similar results.

Classifier with LDAFollowing the same steps outlined for logistic regression:If you run the code above, you should see that we get a perfect classifier again, with identical results to the classifier using logistic regression.

Classifier with QDANow, we repeat the process, but using QDA:And again, the results are the same!In this article, you learned about the inner workings of logistic regression, LDA and QDA for classification.

You also learned how to implement each algorithm in Python to solve a classification problem.

You went through a typical workflow for classification, where classes must be encoded, and where you must check if your dataset is unbalanced.