Logistic regression does not support imbalanced classification directly.
Instead, the training algorithm used to fit the logistic regression model must be modified to take the skewed distribution into account.
This can be achieved by specifying a class weighting configuration that is used to influence the amount that logistic regression coefficients are updated during training.
The weighting can penalize the model less for errors made on examples from the majority class and penalize the model more for errors made on examples from the minority class.
The result is a version of logistic regression that performs better on imbalanced classification tasks, generally referred to as cost-sensitive or weighted logistic regression.
In this tutorial, you will discover cost-sensitive logistic regression for imbalanced classification.
After completing this tutorial, you will know:Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.
Let’s get started.
Cost-Sensitive Logistic Regression for Imbalanced ClassificationPhoto by Naval S, some rights reserved.
This tutorial is divided into five parts; they are:Before we dive into the modification of logistic regression for imbalanced classification, let’s first define an imbalanced classification dataset.
We can use the make_classification() function to define a synthetic imbalanced two-class classification dataset.
We will generate 10,000 examples with an approximate 1:100 minority to majority class ratio.
Once generated, we can summarize the class distribution to confirm that the dataset was created as we expected.
Finally, we can create a scatter plot of the examples and color them by class label to help understand the challenge of classifying examples from this dataset.
Tying this together, the complete example of generating the synthetic dataset and plotting the examples is listed below.
Running the example first creates the dataset and summarizes the class distribution.
We can see that the dataset has an approximate 1:100 class distribution with a little less than 10,000 examples in the majority class and 100 in the minority class.
Next, a scatter plot of the dataset is created showing the large mass of examples for the majority class (blue) and a small number of examples for the minority class (orange), with some modest class overlap.
Scatter Plot of Binary Classification Dataset With 1 to 100 Class ImbalanceNext, we can fit a standard logistic regression model on the dataset.
We will use repeated cross-validation to evaluate the model, with three repeats of 10-fold cross-validation.
The mode performance will be reported using the mean ROC area under curve (ROC AUC) averaged over repeats and all folds.
Tying this together, the complete example of evaluated standard logistic regression on the imbalanced classification problem is listed below.
Running the example evaluates the standard logistic regression model on the imbalanced dataset and reports the mean ROC AUC.
We can see that the model has skill, achieving a ROC AUC above 0.
5, in this case achieving a mean score of 0.
This provides a baseline for comparison for any modifications performed to the standard logistic regression algorithm.
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Download Your FREE Mini-CourseLogistic regression is an effective model for binary classification tasks, although by default, it is not effective at imbalanced classification.
Logistic regression can be modified to be better suited for logistic regression.
The coefficients of the logistic regression algorithm are fit using an optimization algorithm that minimizes the negative log likelihood (loss) for the model on the training dataset.
This involves the repeated use of the model to make predictions followed by an adaptation of the coefficients in a direction that reduces the loss of the model.
The calculation of the loss for a given set of coefficients can be modified to take the class balance into account.
By default, the errors for each class may be considered to have the same weighting, say 1.
These weightings can be adjusted based on the importance of each class.
The weighting is applied to the loss so that smaller weight values result in a smaller error value, and in turn, less update to the model coefficients.
A larger weight value results in a larger error calculation, and in turn, more update to the model coefficients.
As such, the modified version of logistic regression is referred to as Weighted Logistic Regression, Class-Weighted Logistic Regression or Cost-Sensitive Logistic Regression.
The weightings are sometimes referred to as importance weightings.
Although straightforward to implement, the challenge of weighted logistic regression is the choice of the weighting to use for each class.
The scikit-learn Python machine learning library provides an implementation of logistic regression that supports class weighting.
The LogisticRegression class provides the class_weight argument that can be specified as a model hyperparameter.
The class_weight is a dictionary that defines each class label (e.
0 and 1) and the weighting to apply in the calculation of the negative log likelihood when fitting the model.
For example, a 1 to 1 weighting for each class 0 and 1 can be defined as follows:The class weighing can be defined multiple ways; for example:A best practice for using the class weighting is to use the inverse of the class distribution present in the training dataset.
For example, the class distribution of the test dataset is a 1:100 ratio for the minority class to the majority class.
The inversion of this ratio could be used with 1 for the majority class and 100 for the minority class; for example:We might also define the same ratio using fractions and achieve the same result; for example:We can evaluate the logistic regression algorithm with a class weighting using the same evaluation procedure defined in the previous section.
We would expect that the class-weighted version of logistic regression to perform better than the standard version of logistic regression without any class weighting.
The complete example is listed below.
Running the example prepares the synthetic imbalanced classification dataset, then evaluates the class-weighted version of logistic regression using repeated cross-validation.
The mean ROC AUC score is reported, in this case showing a better score than the unweighted version of logistic regression, 0.
989 as compared to 0.
The scikit-learn library provides an implementation of the best practice heuristic for the class weighting.
It is implemented via the compute_class_weight() function and is calculated as:We can test this calculation manually on our dataset.
For example, we have 10,000 examples in the dataset, 9990 in class 0, and 100 in class 1.
The weighting for class 0 is calculated as:The weighting for class 1 is calculated as:We can confirm these calculations by calling the compute_class_weight() function and specifying the class_weight as “balanced.
” For example:Running the example, we can see that we can achieve a weighting of about 0.
5 for class 0 and a weighting of 50 for class 1.
These values match our manual calculation.
The values also match our heuristic calculation above for inverting the ratio of the class distribution in the training dataset; for example:We can use the default class balance directly with the LogisticRegression class by setting the class_weight argument to ‘balanced.
’ For example:The complete example is listed below.
Running the example gives the same mean ROC AUC as we achieved by specifying the inverse class ratio manually.
Using a class weighting that is the inverse ratio of the training data is just a heuristic.
It is possible that better performance can be achieved with a different class weighting, and this too will depend on the choice of performance metric used to evaluate the model.
In this section, we will grid search a range of different class weightings for weighted logistic regression and discover which results in the best ROC AUC score.
We will try the following weightings for class 0 and 1:These can be defined as grid search parameters for the GridSearchCV class as follows:We can perform the grid search on these parameters using repeated cross-validation and estimate model performance using ROC AUC:Once executed, we can summarize the best configuration as well as all of the results as follows:Tying this together, the example below grid searches five different class weights for logistic regression on the imbalanced dataset.
We might expect that the heuristic class weighing is the best performing configuration.
Running the example evaluates each class weighting using repeated k-fold cross-validation and reports the best configuration and the associated mean ROC AUC score.
In this case, we can see that the 1:100 majority to minority class weighting achieved the best mean ROC score.
This matches the configuration for the general heuristic.
It might be interesting to explore even more severe class weightings to see their effect on the mean ROC AUC score.
This section provides more resources on the topic if you are looking to go deeper.
In this tutorial, you discovered cost-sensitive logistic regression for imbalanced classification.
Specifically, you learned:Do you have any questions? Ask your questions in the comments below and I will do my best to answer.
with just a few lines of python codeDiscover how in my new Ebook: Imbalanced Classification with PythonIt provides self-study tutorials and end-to-end projects on: Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms and much more.