Best Resources for Imbalanced Classification

Classification is a predictive modeling problem that involves predicting a class label for a given example.

It is generally assumed that the distribution of examples in the training dataset is even across all of the classes.

In practice, this is rarely the case.

Those classification predictive models where the distribution of examples across class labels is not equal (e.

g.

are skewed) are called “imbalanced classification.

”Typically, a slight imbalance is not a problem and standard machine learning techniques can be used.

In those cases where the imbalance is severe, such as a 1:100, 1:1000, or higher ratio of the minority to the majority class, then specialized techniques are required.

The reason why specialized techniques are required for classification problems with a severe imbalance in the classes is that most machine learning models used for classification were designed and tested around the assumption that the class distribution is equal.

As such, they often fail or result in misleading results.

In this tutorial, you will discover the best resources that you can use to get started with imbalanced classification.

After completing this tutorial, you will know:Let’s get started.

Best Resources for Imbalanced ClassificationPhoto by Radek Kucharski, some rights reserved.

This tutorial is divided into three parts; they are:Addressing imbalanced classification predictive modeling problems with machine learning is a relatively new area of study.

Nevertheless, given the pervasiveness of imbalanced classification datasets, a few books and book chapters are available on the topic.

In this section, we will take a closer look at the following books on imbalanced classification for machine learning:I will also include the following book that features a dedicated chapter on the topic:There are two other books I found that are related, but perhaps more tangentially, and I won’t cover them in more detail; they were:Let’s take a closer look at the books.

This book is a collection of papers that form chapters, edited by two academics that have written a lot on the topic: Haibo He and Yunqian Ma.

The book was published in 2013.

Imbalanced Learning – Foundations, Algorithms, and ApplicationsThe book is designed to bring a postgraduate student or academic up to speed with the field of imbalanced learning.

This is a more general field than imbalanced classification, as it includes other problem types where the training dataset may be imbalanced, such as regression and clustering.

Specifically, we define imbalanced learning as the learning process for data representation and information extraction with severe data distribution skews to develop effective decision boundaries to support the decision-making process.

The learning process could involve supervised learning, unsupervised learning, semi-supervised learning, or a combination of two or all of them.

The task of imbalanced learning could also be applied to regression, classification, or clustering tasks.

— Pages 1-2, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

It provides an excellent starting point for a practitioner to get an overview of the field and the techniques.

The table of contents for this book is listed below.

Learn more about the book here.

This book is also a collection of papers on the topic of machine learning for imbalanced datasets, although feels more cohesiveness than the previous book “Imbalanced Learning.

”The book was written or edited by a laundry list of academics Alberto Fernández, Salvador García, Mikel Galar, Ronaldo Prati, Bartosz Krawczyk, and Francisco Herrera and was published in 2018.

Learning from Imbalanced Data SetsSimilar to the previous book, this book is designed to bring postgraduate students and engineers up to speed with the field of machine learning for imbalanced datasets.

The intended audience of this book are developers and engineers aiming to apply imbalance-learning techniques to solve different kinds of real-world problems, as well as researchers and students needing a comprehensive review on techniques, methodologies, and tools for learning from imbalanced data.

— Page viii, Learning from Imbalanced Data Sets, 2018.

The book reads as being more systematic (e.

g.

working through a project end-to-end) and practical than the previous book, which read as more academic (pet methods or subfields).

I would recommend buying both together if you had the budget.

The table of contents for this book is listed below.

Learn more about the book here.

This is one of my favorite handbooks for applied machine learning, written by Max Kuhn and Kjell Johnson and focused on R.

The book was published in 2013, but the general advice is probably timeless.

Applied Predictive ModelingAlthough the whole book is a great read, the book has one chapter dedicated to the problem of imbalanced classification.

The approach to the chapter is a case study on a “Caravan Policy Ownership” dataset.

The authors work through this problem to demonstrate a suite of different practical techniques for handling a severe class imbalance.

This chapter is required reading for a practical demonstration on how to work through a real-world imbalanced dataset using modern methods.

The sections of this chapter are as follows:Learn more about the book here.

There are thousands of publications on machine learning methods for imbalanced classification and related problems and techniques.

Instead of enumerating the best papers in the field, in this section, we will take a look at some of the best survey papers.

A survey paper is a paper that gives a broad overview of the field and position of the techniques in the field and how they might relate to each other.

They are designed to help newcomers to the field, such as postgraduate students and engineers, get up-to-speed rapidly.

As a practitioner, reading a survey paper may be more efficient than skimming books on the topic.

There are many great survey papers to choose from; my recommended favorites are as follows:I also recommend study papers, papers that demonstrate one or more standard techniques against a suite of standard machine learning datasets.

In this case, the techniques are designed to address the imbalanced class distribution and the standard datasets have a skewed class distribution.

These papers quickly flush out what methods work (or are popular) and what datasets are useful as benchmarks.

Some examples of good papers of this type include:Python has rapidly become the preferred programming language for applied machine learning.

The go-to library for machine learning in Python is scikit-learn, which provides data preparation, machine learning algorithms, and model evaluation schemes, among other techniques.

Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.

This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language.

— Scikit-learn: Machine Learning in Python, 2011.

Although not designed around the problem of imbalanced classification, the scikit-learn library does provide some tools for handling imbalanced datasets, such as:A project related to scikit-learn dedicated to the problem of imbalanced classification is called imbalanced-learn.

It provides techniques that can be used for imbalanced classification in conjunction with the scikit-learn library, allowing learning algorithms and model evaluation techniques to be shared between the libraries.

imbalanced-learn is an open-source python toolbox aiming at providing a wide range of methods to cope with the problem of imbalanced dataset frequently encountered in machine learning and pattern recognition.

— Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning, 2016.

The library focuses on providing oversampling and undersampling techniques to make the class distribution more equal in a training dataset prior to fitting a given machine learning model.

For more on imbalanced-learn, see:In this tutorial, you discovered the best resources that you can use to get started with imbalanced classification.

Specifically, you learned:Do you have any questions?.Ask your questions in the comments below and I will do my best to answer.

.

. More details

Leave a Reply