A common mistake made by beginners is to apply machine learning algorithms to a problem without establishing a performance baseline.

A performance baseline provides a minimum score above which a model is considered to have skill on the dataset.

It also provides a point of relative improvement for all models evaluated on the dataset.

A baseline can be established using a naive classifier, such as predicting one class label for all examples in the test dataset.

Another common mistake made by beginners is using classification accuracy as a performance metric on problems that have an imbalanced class distribution.

This can result in high accuracy scores even when the majority class is predicted for all cases.

Instead, an alternate performance metric must be chosen among a suite of classification measures.

The challenge is that the baseline in performance is dependent upon the choice of performance metric.

As such, deep knowledge of each performance metric may be required in order to select an appropriate naive classifier to establish a performance baseline.

In this tutorial, you will discover which naive classifier to use for each imbalanced classification performance metric.

After completing this tutorial, you will know:Let’s get started.

What Is the Naive Classifier for Each Imbalanced Classification Metric?Photo by the Bureau of Land Management, some rights reserved.

This tutorial is divided into four parts; they are:There are many metrics to choose from for imbalanced classification.

Choosing a metric might be the most important step of the project, as choosing the wrong metric can result in optimizing and choosing a model that solves a problem that is different from the problem that you actually want to solve.

As such, there are perhaps 5 metrics from the tens or hundreds most commonly used that work for imbalanced classification.

They are as follows:Metrics for evaluating predicted class labels:Metrics for evaluating predicted probabilities:For more on how to calculate each metric, see the tutorial:A naive classifier is a classification algorithm with no logic that provides a baseline of performance on a classification dataset.

It is important to establish a baseline in performance for a classification dataset.

It provides a line in the sand by which all other algorithms can be compared.

An algorithm that achieves a score below a naive classification model has no skill on the dataset, whereas an algorithm that achieves a score above that of a naive classification model has some skill on the dataset.

There are perhaps five different naive classification methods that can be used to establish a baseline of performance on a dataset.

Explained in the context of an imbalanced two-class (binary) classification problem, the naive classification methods are as follows:These can be implemented using the DummyClassifier class form the scikit-learn library.

This class provides the strategy argument that allows different naive classifier techniques to be used.

Examples include:For more on naive classifiers, see the tutorial:We have established that there are many different metrics to choose from for an imbalanced classification problem.

We have also established that it is critical to determine a baseline in performance for a new classification problem using a naive classifier.

The challenge is, each classification metric requires the careful choice of a specific naive classification strategy that achieves the appropriate “no skill” performance.

This can and should be selected using knowledge of each metric and can be confirmed by careful experimentation.

In this section, we will rationalize the selection of the appropriate naive classifier for each imbalanced classification metric, then confirm the selection with an empirical result on a synthetic binary classification dataset.

The synthetic dataset has 10,000 examples, 99 percent of which belong to the majority class (negative case or class label 0) and 1 percent of which belong to the minority class (positive case or class label 1).

Each naive classifier strategy is evaluated using stratified 10-fold cross-validation with three repeats, and performance is summarized using the mean and standard deviation across these runs.

The mapping from metrics to naive classifier can be used on your next imbalanced classification project, and the empirical results confirm the rationale and help to establish the intuition for each mapping.

Let’s dive in.

Classification accuracy is the total number of correct predictions divided by the total number of predictions made.

The appropriate naive classifier for classification accuracy is to predict the majority class in all cases.

This will maximize the true negatives and minimize the false negatives.

We can demonstrate this with a worked example comparing each naive classifier strategy on a binary classification problem.

We would expect that predicting the majority class would result in a classification accuracy of approximately 99 percent on this dataset.

The complete example is listed below.

Running the example reports the classification accuracy for each naive classifier strategy.

Your results may vary slightly given the stochastic nature of some of the methods.

Try running the example a few times.

In this case, we can see that the majority strategy achieves the best classification accuracy of 99 percent, as we expected.

We can also see that the prior strategy achieves the same result as it predicts mostly 0.

01 (1 percent for the positive class) in all cases, which is mapped to the majority class label 0.

Box and whisker plots for each naive classifier are also created, allowing the distribution of scores to be compared visually.

Box and Whisker Plot for Naive Classifier Strategies Evaluated Using Classification AccuracyThe geometric mean, or G-Mean, is the geometric mean of the sensitivity and specificity scores.

Sensitivity summarizes how well the positive class was predicted, and specificity summarizes how well the negative class was predicted.

Performing perfectly well on the majority or minority class will come at the cost of a worst-case performance on the other class, which will result in a zero G-Mean score.

Therefore, the most appropriate naive classification strategy is to predict each class with an equal probability, which will give each class an opportunity for a correct prediction.

We can demonstrate this with a worked example comparing each naive classifier strategy on a binary classification problem.

We would expect that predict a uniformly random class label would result in a G-Mean of approximately 0.

5 on this dataset.

The complete example is listed below.

Running the example reports the G-mean for each naive classifier strategy.

Your results may vary slightly given the stochastic nature of some of the methods.

Try running the example a few times.

In this case, we can see that, as expected, the uniformly random naive classifier resulted in a G-Mean of 0.

5 and all other strategies resulted in a G-Mean score of 0.

Box and whisker plots for each naive classifier are also created, allowing the distribution of scores to be compared visually.

Box and Whisker Plot for Naive Classifier Strategies Evaluated Using G-MeanThe F-measure (also called the F1-score) is calculated as the harmonic mean between the precision and the recall.

Precision summarizes the fraction of examples assigned the positive class that belong to the positive class and recall summarizes how well the positive class was predicted out of all positive predictions that could have been made.

Making predictions that favor precision (e.

g.

predict the minority class) will also result in a lower bound on the recall.

Therefore, the naive strategy for the F-measure is to predict the minority class in all cases.

We can demonstrate this with a worked example comparing each naive classifier strategy on a binary classification problem.

The F-measure when predicting only the minority class for this dataset is not obvious at first.

Precision will be perfect, or 1.

0.

The recall will be equivalent to the prior for the minority class, that is 1 percent or 0.

01.

Therefore, the F-measure is the harmonic mean between 1.

0 and 0.

01, which is about 0.

02.

The complete example is listed below.

Running the example reports the ROC AUC for each naive classifier strategy.

Your results may vary slightly given the stochastic nature of some of the methods.

Try running the example a few times.

You may get a warning when evaluating the naive classifier that only predicts the minority class, as there are no positive cases predicted.

You will see a warning as follows:In this case, we can see that predicting the minority class results in the expected F-measure of about 0.

02.

We can also see that we approximate this score when using the uniform and stratified strategies.

Box and whisker plots for each naive classifier are also created, allowing the distribution of scores to be compared visually.

Box and Whisker Plot for Naive Classifier Strategies Evaluated Using F-MeasureThis same naive classifier strategy of predicting the minority class is also appropriate when using the F0.

5 and F2 measures.

The ROC Curve is a plot of the false positive rate versus the true positive rate for a range of different probability thresholds.

The ROC area under curve is an approximation of the integral or area under the ROC curve and summarizes how well an algorithm performs across the range of probability thresholds.

A no-skill model has a ROC AUC of 0.

5 and can be achieved by predicting class labels randomly but in proportion to their base rate (e.

g.

no discrimination power).

This would be the stratified method.

Predicting a constant value, like the majority class or minority class will result in an invalid ROC Curve (e.

g.

a point) and in turn an invalid ROC AUC score.

Scores for models that predict a constant value should be ignored.

The complete example is listed below.

Running the example reports the ROC AUC for each naive classifier strategy.

Your results may vary slightly given the stochastic nature of some of the methods.

Try running the example a few times.

In this case, we can see that as expected, predicting a stratified random label results in the worst-case ROC AUC of 0.

5.

Box and Whisker Plot for Naive Classifier Strategies Evaluated Using ROC AUCThe Precision-Recall Curve (or PR Curve) is a plot of the recall versus the precision for a range of different probability thresholds.

The Precision-Recall area under curve is an approximation of the integral or area under the Precision-Recall curve and summarizes how well an algorithm performs across the range of probability thresholds.

A no-skill model has a PR AUC that matches the base rate of the positive class, e.

g.

0.

01.

This can be achieved by predicting class labels randomly but in proportion to their base rate (e.

g.

no discrimination power).

This would be the stratified method.

Predicting a constant value, like the majority class or minority class will result in an invalid PR Curve (e.

g.

a point) and in turn an invalid PR AUC score.

Scores for models that predict a constant value should be ignored.

The complete example is listed below.

Running the example reports the PR AUC score for each naive classifier strategy.

Your results may vary slightly given the stochastic nature of some of the methods.

Try running the example a few times.

In this case, we can see that as expected, predicting a stratified random class label results in the worst-case PR AUC of close to 0.

01.

Box and Whisker Plot for Naive Classifier Strategies Evaluated Using Precision-Recall AUCBrier score calculates the mean squared error between the expected probabilities and the predicted probabilities.

The appropriate naive classifier for Brier score is to predict the class priors for each example in the test set.

For a binary classification problem that involves predicting a Binomial distribution, this would be the prior for class 0 and the prior for class 1.

The model would predict the probabilities [0.

99, 0.

01] in all cases.

We would expect that this will result in mean squared error close to the prior for the minority class, e.

g.

0.

01 on this dataset.

This is because the Binomial probability for most examples is 0.

0 with only 1 percent having 1.

0 which results in a maximum error for 1 percent of cases, or a Brier score of 0.

01.

The complete example is listed below.

Running the example reports the Brier score for each naive classifier strategy.

Your results may vary slightly given the stochastic nature of some of the methods.

Try running the example a few times.

Brier score is minimized, with 0.

0 representing the lowest possible score.

As such, the scikit-learn inverts the score by making it negative, hence the negative mean Brier scores for each naive classifier.

The sign can, therefore, be ignored.

As expected, we can see that predicting the prior probability results in the best score.

We can also see that predicting the majority class also results in the same best Brier score.

Box and Whisker Plot for Naive Classifier Strategies Evaluated Using Brier ScoreWe can summarize the mapping of imbalanced classification metrics to naive classification methods.

This provides a look-up table that you can consult on your next imbalanced classification project.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered which naive classifier to use for each imbalanced classification performance metric.

Specifically, you learned:Do you have any questions? Ask your questions in the comments below and I will do my best to answer.

.