Imbalanced Classification With Python (7-Day Mini-Course)

Last Updated on January 17, 2020Classification predictive modeling is the task of assigning a label to an example.

Imbalanced classification are those classification tasks where the distribution of examples across the classes is not equal.

Practical imbalanced classification requires the use of a suite of specialized techniques, data preparation techniques, learning algorithms, and performance metrics.

In this crash course, you will discover how you can get started and confidently work through an imbalanced classification project with Python in seven days.

This is a big and important post.

You might want to bookmark it.

Let’s get started.

Imbalanced Classification With Python (7-Day Mini-Course)Photo by Arches National Park, some rights reserved.

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning.

Maybe you know how to work through a predictive modeling problem end-to-end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:You do NOT need to be:This crash course will take you from a developer who knows a little machine learning to a developer who can navigate an imbalanced classification project.

Note: This crash course assumes you have a working Python 3 SciPy environment with at least NumPy installed.

If you need help with your environment, you can follow the step-by-step tutorial here:Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-CourseThis crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore).

It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with imbalanced classification in Python:Each lesson could take you 60 seconds or up to 30 minutes.

Take your time and complete the lessons at your own pace.

Ask questions and even post results in the comments below.

The lessons might expect you to go off and find out how to do things.

I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the algorithms and the best-of-breed tools in Python.

(Hint: I have all of the answers directly on this blog; use the search box.

)Post your results in the comments; I’ll cheer you on!Hang in there; don’t give up.

Note: This is just a crash course.

For a lot more detail and fleshed-out tutorials, see my book on the topic titled “Imbalanced Classification with Python.

”In this lesson, you will discover the challenge of imbalanced classification problems.

Imbalanced classification problems pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class.

This results in models that have poor predictive performance, specifically for the minority class.

This is a problem because typically, the minority class is more important and therefore the problem is more sensitive to classification errors for the minority class than the majority class.

A classification problem may be a little skewed, such as if there is a slight imbalance.

Alternately, the classification problem may have a severe imbalance where there might be hundreds or thousands of examples in one class and tens of examples in another class for a given training dataset.

Many of the classification predictive modeling problems that we are interested in solving in practice are imbalanced.

As such, it is surprising that imbalanced classification does not get more attention than it does.

For this lesson, you must list five general examples of problems that inherently have a class imbalance.

One example might be fraud detection, another might be intrusion detection.

Post your answer in the comments below.

I would love to see what you come up with.

In the next lesson, you will discover how to develop an intuition for skewed class distributions.

In this lesson, you will discover how to develop a practical intuition for imbalanced classification datasets.

A challenge for beginners working with imbalanced classification problems is what a specific skewed class distribution means.

For example, what is the difference and implication for a 1:10 vs.

a 1:100 class ratio?The make_classification() scikit-learn function can be used to define a synthetic dataset with a desired class imbalance.

The “weights” argument specifies the ratio of examples in the negative class, e.

g.

[0.

99, 0.

01] means that 99 percent of the examples will belong to the majority class, and the remaining 1 percent will belong to the minority class.

Once defined, we can summarize the class distribution using a Counter object to get an idea of exactly how many examples belong to each class.

We can also create a scatter plot of the dataset because there are only two input variables.

The dots can then be colored by each class.

This plot provides a visual intuition for what exactly a 99 percent vs.

1 percent majority/minority class imbalance looks like in practice.

The complete example of creating and summarizing an imbalanced classification dataset is listed below.

For this lesson, you must run the example and review the plot.

For bonus points, you can test different class ratios and review the results.

Post your answer in the comments below.

I would love to see what you come up with.

In the next lesson, you will discover how to evaluate models for imbalanced classification.

In this lesson, you will discover how to evaluate models on imbalanced classification problems.

Prediction accuracy is the most common metric for classification tasks, although it is inappropriate and potentially dangerously misleading when used on imbalanced classification tasks.

The reason for this is because if 98 percent of the data belongs to the negative class, you can achieve 98 percent accuracy on average by simply predicting the negative class all the time, achieving a score that naively looks good, but in practice has no skill.

Instead, alternate performance metrics must be adopted.

Popular alternatives are the precision and recall scores that allow the performance of the model to be considered by focusing on the minority class, called the positive class.

Precision calculates the ratio of the number of correctly predicted positive examples divided by the total number of positive examples that were predicted.

Maximizing the precision will minimize the false negatives.

Recall predicts the ratio of the total number of correctly predicted positive examples divided by the total number of positive examples that could have been predicted.

Maximizing recall will minimize false positives.

The performance of a model can be summarized by a single score that averages both the precision and the recall, called the F-Measure.

Maximizing the F-Measure will maximize both the precision and recall at the same time.

The example below fits a logistic regression model on an imbalanced classification problem and calculates the accuracy, which can then be compared to the precision, recall, and F-measure.

For this lesson, you must run the example and compare the classification accuracy to the other metrics, such as precision, recall, and F-measure.

For bonus points, try other metrics such as Fbeta-measure and ROC AUC scores.

Post your answer in the comments below.

I would love to see what you come up with.

In the next lesson, you will discover how to undersample the majority class.

In this lesson, you will discover how to undersample the majority class in the training dataset.

A simple approach to using standard machine learning algorithms on an imbalanced dataset is to change the training dataset to have a more balanced class distribution.

This can be achieved by deleting examples from the majority class, referred to as “undersampling.

” A possible downside is that examples from the majority class that are helpful during modeling may be deleted.

The imbalanced-learn library provides many examples of undersampling algorithms.

This library can be installed easily using pip; for example:A fast and reliable approach is to randomly delete examples from the majority class to reduce the imbalance to a ratio that is less severe or even so that the classes are even.

The example below creates a synthetic imbalanced classification data, then uses RandomUnderSampler class to change the class distribution from 1:100 minority to majority classes to the less severe 1:2.

For this lesson, you must run the example and note the change in the class distribution before and after undersampling the majority class.

For bonus points, try other undersampling ratios or even try other undersampling techniques provided by the imbalanced-learn library.

Post your answer in the comments below.

I would love to see what you come up with.

In the next lesson, you will discover how to oversample the minority class.

In this lesson, you will discover how to oversample the minority class in the training dataset.

An alternative to deleting examples from the majority class is to add new examples from the minority class.

This can be achieved by simply duplicating examples in the minority class, but these examples do not add any new information.

Instead, new examples from the minority can be synthesized using existing examples in the training dataset.

These new examples will be “close” to existing examples in the feature space, but different in small but random ways.

The SMOTE algorithm is a popular approach for oversampling the minority class.

This technique can be used to reduce the imbalance or to make the class distribution even.

The example below demonstrates using the SMOTE class provided by the imbalanced-learn library on a synthetic dataset.

The initial class distribution is 1:100 and the minority class is oversampled to a 1:2 distribution.

For this lesson, you must run the example and note the change in the class distribution before and after oversampling the minority class.

For bonus points, try other oversampling ratios, or even try other oversampling techniques provided by the imbalanced-learn library.

Post your answer in the comments below.

I would love to see what you come up with.

In the next lesson, you will discover how to combine undersampling and oversampling techniques.

In this lesson, you will discover how to combine data undersampling and oversampling on a training dataset.

Data undersampling will delete examples from the majority class, whereas data oversampling will add examples to the majority class.

These two approaches can be combined and used on a single training dataset.

Given that there are so many different data sampling techniques to choose from, it can be confusing as to which methods to combine.

Thankfully, there are common combinations that have been shown to work well in practice; some examples include:These combinations can be applied manually to a given training dataset by first applying one sampling algorithm, then another.

Thankfully, the imbalanced-learn library provides implementations of common combined data sampling techniques.

The example below demonstrates how to use the SMOTEENN that combines both SMOTE oversampling of the minority class and Edited Nearest Neighbors undersampling of the majority class.

For this lesson, you must run the example and note the change in the class distribution before and after the data sampling.

For bonus points, try other combined data sampling techniques or even try manually applying oversampling followed by undersampling on the dataset.

Post your answer in the comments below.

I would love to see what you come up with.

In the next lesson, you will discover how to use cost-sensitive algorithms for imbalanced classification.

In this lesson, you will discover how to use cost-sensitive algorithms for imbalanced classification.

Most machine learning algorithms assume that all misclassification errors made by a model are equal.

This is often not the case for imbalanced classification problems, where missing a positive or minority class case is worse than incorrectly classifying an example from the negative or majority class.

Cost-sensitive learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model.

Many machine learning algorithms can be updated to be cost-sensitive, where the model is penalized for misclassification errors from one class more than the other, such as the minority class.

The scikit-learn library provides this capability for a range of algorithms via the class_weight attribute specified when defining the model.

A weighting can be specified that is inversely proportional to the class distribution.

If the class distribution was 0.

99 to 0.

01 for the majority and minority classes, then the class_weight argument could be defined as a dictionary that defines a penalty of 0.

01 for errors made for the majority class and a penalty of 0.

99 for errors made with the minority class, e.

g.

{0:0.

01, 1:0.

99}.

This is a useful heuristic and can be configured automatically by setting the class_weight argument to the string ‘balanced‘.

The example below demonstrates how to define and fit a cost-sensitive logistic regression model on an imbalanced classification dataset.

For this lesson, you must run the example and review the performance of the cost-sensitive model.

For bonus points, compare the performance to the cost-insensitive version of logistic regression.

Post your answer in the comments below.

I would love to see what you come up with.

This was the final lesson of the mini-course.

You made it.

Well done!Take a moment and look back at how far you have come.

You discovered:Take the next step and check out my book on Imbalanced Classification with Python.

How did you do with the mini-course? Did you enjoy this crash course?Do you have any questions? Were there any sticking points? Let me know.

Leave a comment below.

with just a few lines of python codeDiscover how in my new Ebook: Imbalanced Classification with PythonIt provides self-study tutorials and end-to-end projects on: Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms and much more.

.

Leave a Reply