Many imbalanced classification tasks require a skillful model that predicts a crisp class label, where both classes are equally important.

An example of an imbalanced classification problem where a class label is required and both classes are equally important is the detection of oil spills or slicks in satellite images.

The detection of a spill requires mobilizing an expensive response, and missing an event is equally expensive, causing damage to the environment.

One way to evaluate imbalanced classification models that predict crisp labels is to calculate the separate accuracy on the positive class and the negative class, referred to as sensitivity and specificity.

These two measures can then be averaged using the geometric mean, referred to as the G-mean, that is insensitive to the skewed class distribution and correctly reports on the skill of the model on both classes.

In this tutorial, you will discover how to develop a model to predict the presence of an oil spill in satellite images and evaluate it using the G-mean metric.

After completing this tutorial, you will know:Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

Develop an Imbalanced Classification Model to Detect Oil SpillsPhoto by Lenny K Photography, some rights reserved.

This tutorial is divided into five parts; they are:In this project, we will use a standard imbalanced machine learning dataset referred to as the “oil spill” dataset, “oil slicks” dataset or simply “oil.

”The dataset was introduced in the 1998 paper by Miroslav Kubat, et al.

titled “Machine Learning for the Detection of Oil Spills in Satellite Radar Images.

” The dataset is often credited to Robert Holte, a co-author of the paper.

The dataset was developed by starting with satellite images of the ocean, some of which contain an oil spill and some that do not.

Images were split into sections and processed using computer vision algorithms to provide a vector of features to describe the contents of the image section or patch.

The input to [the system] is a raw pixel image from a radar satellite Image processing techniques are used […] The output of the image processing is a fixed-length feature vector for each suspicious region.

During normal operation these feature vectors are fed into a classier to decide which images and which regions within an image to present for human inspection.

— Machine Learning for the Detection of Oil Spills in Satellite Radar Images, 1998.

The task is given a vector that describes the contents of a patch of a satellite image, then predicts whether the patch contains an oil spill or not, e.

g.

from the illegal or accidental dumping of oil in the ocean.

There are 937 cases.

Each case is comprised of 48 numerical computer vision derived features, a patch number, and a class label.

A total of nine satellite images were processed into patches.

Cases in the dataset are ordered by image and the first column of the dataset represents the patch number for the image.

This was provided for the purposes of estimating model performance per-image.

In this case, we are not interested in the image or patch number and this first column can be removed.

The normal case is no oil spill assigned the class label of 0, whereas an oil spill is indicated by a class label of 1.

There are 896 cases for no oil spill and 41 cases of an oil spill.

The second critical feature of the oil spill domain can be called an imbalanced training set: there are very many more negative examples lookalikes than positive examples oil slicks.

Against the 41 positive examples we have 896 negative examples the majority class thus comprises almost 96% of the data.

— Machine Learning for the Detection of Oil Spills in Satellite Radar Images, 1998.

We do not have access to the program used to prepare computer vision features from the satellite images, therefore we are restricted to work with the extracted features that were collected and made available.

Next, let’s take a closer look at the data.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-CourseFirst, download the dataset and save it in your current working directory with the name “oil-spill.

csv”Review the contents of the file.

The first few lines of the file should look as follows:We can see that the first column contains integers for the patch number.

We can also see that the computer vision derived features are real-valued with differing scales such as thousands in the second column and fractions in other columns.

All input variables are numeric, and there are no missing values marked with a “?” character.

Firstly, we can load the CSV dataset and confirm the number of rows and columns.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location and the fact that there is no header line.

Once loaded, we can summarize the number of rows and columns by printing the shape of the DataFrame.

We can also summarize the number of examples in each class using the Counter object.

Tying this together, the complete example of loading and summarizing the dataset is listed below.

Running the example first loads the dataset and confirms the number of rows and columns.

The class distribution is then summarized, confirming the number of oil spills and non-spills and the percentage of cases in the minority and majority classes.

We can also take a look at the distribution of each variable by creating a histogram for each.

With 50 variables, it is a lot of plots, but we might spot some interesting patterns.

Also, with so many plots, we must turn off the axis labels and plot titles to reduce the clutter.

The complete example is listed below.

Running the example creates the figure with one histogram subplot for each of the 50 variables in the dataset.

We can see many different distributions, some with Gaussian-like distributions, others with seemingly exponential or discrete distributions.

Depending on the choice of modeling algorithms, we would expect scaling the distributions to the same range to be useful, and perhaps the use of some power transforms.

Histogram of Each Variable in the Oil Spill DatasetNow that we have reviewed the dataset, let’s look at developing a test harness for evaluating candidate models.

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split.

We will use k=10, meaning each fold will contain about 937/10 or about 94 examples.

Stratified means that each fold will contain the same mixture of examples by class, that is about 96% to 4% non-spill and spill.

Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model.

We will use three repeats.

This means a single model will be fit and evaluated 10 * 3, or 30, times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

We are predicting class labels of whether a satellite image patch contains a spill or not.

There are many measures we could use, although the authors of the paper chose to report the sensitivity, specificity, and the geometric mean of the two scores, called the G-mean.

To this end, we have mainly used the geometric mean (g-mean) […] This measure has the distinctive property of being independent of the distribution of examples between classes, and is thus robust in circumstances where this distribution might change with time or be different in the training and testing sets.

— Machine Learning for the Detection of Oil Spills in Satellite Radar Images, 1998.

Recall that the sensitivity is a measure of the accuracy for the positive class and specificity is a measure of the accuracy of the negative class.

The G-mean seeks a balance of these scores, the geometric mean, where poor performance for one or the other results in a low G-mean score.

We can calculate the G-mean for a set of predictions made by a model using the geometric_mean_score() function provided by the imbalanced-learn library.

First, we can define a function to load the dataset and split the columns into input and output variables.

We will also drop column 22 because the column contains a single value, and the first column that defines the image patch number.

The load_dataset() function below implements this.

We can then define a function that will evaluate a given model on the dataset and return a list of G-Mean scores for each fold and repeat.

The evaluate_model() function below implements this, taking the dataset and model as arguments and returning the list of scores.

Finally, we can evaluate a baseline model on the dataset using this test harness.

A model that predicts the majority class label (0) or the minority class label (1) for all cases will result in a G-mean of zero.

As such, a good default strategy would be to randomly predict one class label or another with a 50% probability and aim for a G-mean of about 0.

5.

This can be achieved using the DummyClassifier class from the scikit-learn library and setting the “strategy” argument to ‘uniform‘.

Once the model is evaluated, we can report the mean and standard deviation of the G-mean scores directly.

Tying this together, the complete example of loading the dataset, evaluating a baseline model and reporting the performance is listed below.

Running the example first loads and summarizes the dataset.

We can see that we have the correct number of rows loaded, and that we have 47 computer vision derived input variables, with the constant value column (index 22) and the patch number column (index 0) removed.

Importantly, we can see that the class labels have the correct mapping to integers with 0 for the majority class and 1 for the minority class, customary for imbalanced binary classification dataset.

Next, the average of the G-Mean scores is reported.

Your specific results will vary given the stochastic nature of the algorithm; consider running the example a few times.

In this case, we can see that the baseline algorithm achieves a G-Mean of about 0.

47, close to the theoretical maximum of 0.

5.

This score provides a lower limit on model skill; any model that achieves an average G-Mean above about 0.

47 (or really above 0.

5) has skill, whereas models that achieve a score below this value do not have skill on this dataset.

It is interesting to note that a good G-mean reported in the paper was about 0.

811, although the model evaluation procedure was different.

This provides a rough target for “good” performance on this dataset.

Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.

In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.

The goal is to both demonstrate how to work through the problem systematically and to demonstrate the capability of some techniques designed for imbalanced classification problems.

The reported performance is good, but not highly optimized (e.

g.

hyperparameters are not tuned).

What score can you get? If you can achieve better G-mean performance using the same test harness, I’d love to hear about it.

Let me know in the comments below.

Let’s start by evaluating some probabilistic models on the dataset.

Probabilistic models are those models that are fit on the data under a probabilistic framework and often perform well in general for imbalanced classification datasets.

We will evaluate the following probabilistic models with default hyperparameters in the dataset:Both LR and LDA are sensitive to the scale of the input variables, and often expect and/or perform better if input variables with different scales are normalized or standardized as a pre-processing step.

In this case, we will standardize the dataset prior to fitting each model.

This will be achieved using a Pipeline and the StandardScaler class.

The use of a Pipeline ensures that the StandardScaler is fit on the training dataset and applied to the train and test sets within each k-fold cross-validation evaluation, avoiding any data leakage that might result in an optimistic result.

We can define a list of models to evaluate on our test harness as follows:Once defined, we can enumerate the list and evaluate each in turn.

The mean and standard deviation of G-mean scores can be printed during evaluation and the sample of scores can be stored.

Algorithms can be compared directly based on their mean G-mean score.

At the end of the run, we can use the scores to create a box and whisker plot for each algorithm.

Creating the plots side by side allows the distributions to be compared both with regard to the mean score, but also the middle 50 percent of the distribution between the 25th and 75th percentiles.

Tying this together, the complete example comparing three probabilistic models on the oil spill dataset using the test harness is listed below.

Running the example evaluates each of the probabilistic models on the dataset.

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

You may see some warnings from the LDA algorithm such as “Variables are collinear“.

These can be safely ignored for now, but suggests that the algorithm could benefit from feature selection to remove some of the variables.

In this case, we can see that each algorithm has skill, achieving a mean G-mean above 0.

5.

The results suggest that an LDA might be the best performing of the models tested.

The distribution of the G-mean scores is summarized using a figure with a box and whisker plot for each algorithm.

We can see that the distribution for both LDA and NB is compact and skillful and that the LR may have a few results during the run where the method performed poorly, pushing the distribution down.

This highlights that it is not just the mean performance, but also the consistency of the model that should be considered when selecting a model.

Box and Whisker Plot of Probabilistic Models on the Imbalanced Oil Spill DatasetWe’re off to a good start, but we can do better.

The logistic regression algorithm supports a modification that adjusts the importance of classification errors to be inversely proportional to the class weighting.

This allows the model to better learn the class boundary in favor of the minority class, which might help overall G-mean performance.

We can achieve this by setting the “class_weight” argument of the LogisticRegression to ‘balanced‘.

As mentioned, logistic regression is sensitive to the scale of input variables and can perform better with normalized or standardized inputs; as such it is a good idea to test both for a given dataset.

Additionally, a power distribution can be used to spread out the distribution of each input variable and make those variables with a Gaussian-like distribution more Gaussian.

This can benefit models like Logistic Regression that make assumptions about the distribution of input variables.

The power transom will use the Yeo-Johnson method that supports positive and negative inputs, but we will also normalize data prior to the transform.

Also, the PowerTransformer class used for the transform will also standardize each variable after the transform.

We will compare a LogisticRegression with a balanced class weighting to the same algorithm with three different data preparation schemes, specifically normalization, standardization, and a power transform.

Tying this together, the comparison of balanced logistic regression with different data preparation schemes is listed below.

Running the example evaluates each version of the balanced logistic regression model on the dataset.

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

You may see some warnings from the first balanced LR model, such as “Liblinear failed to converge“.

These warnings can be safely ignored for now but suggest that the algorithm could benefit from feature selection to remove some of the variables.

In this case, we can see that the balanced version of logistic regression performs much better than all of the probabilistic models evaluated in the previous section.

The results suggest that perhaps the use of balanced LR with data normalization for pre-processing performs the best on this dataset with a mean G-mean score of about 0.

852.

This is in the range or better than the results reported in the 1998 paper.

A figure is created with box and whisker plots for each algorithm, allowing the distribution of results to be compared.

We can see that the distribution for the balanced LR is tighter in general than the non-balanced version in the previous section.

We can also see that the median result (orange line) for the normalized version is higher than the mean, above 0.

9, which is impressive.

A mean different from the median suggests a skewed distribution for the results, pulling the mean down with a few bad outcomes.

Box and Whisker Plot of Balanced Logistic Regression Models on the Imbalanced Oil Spill DatasetWe now have excellent results with little work; let’s see if we can take it one step further.

Data sampling provides a way to better prepare the imbalanced training dataset prior to fitting a model.

Perhaps the most popular data sampling is the SMOTE oversampling technique for creating new synthetic examples for the minority class.

This can be paired with the edited nearest neighbor (ENN) algorithm that will locate and remove examples from the dataset that are ambiguous, making it easier for models to learn to discriminate between the two classes.

This combination is called SMOTE-ENN and can be implemented using the SMOTEENN class from the imbalanced-learn library; for example:SMOTE and ENN both work better when the input data is scaled beforehand.

This is because both techniques involve using the nearest neighbor algorithm internally and this algorithm is sensitive to input variables with different scales.

Therefore, we will require the data to be normalized as a first step, then sampled, then used as input to the (unbalanced) logistic regression model.

As such, we can use the Pipeline class provided by the imbalanced-learn library to create a sequence of data transforms including the data sampling method, and ending with the logistic regression model.

We will compare four variations of the logistic regression model with data sampling, specifically:The expectation is that LR will perform better with SMOTEENN, and that SMOTEENN will perform better with standardization or normalization.

The last case does a lot, first normalizing the dataset, then applying the power transform, standardizing the result (recall that the PowerTransformer class will standardize the output by default), applying SMOTEENN, then finally fitting a logistic regression model.

These combinations can be defined as follows:Tying this together, the complete example is listed below.

Running the example evaluates each version of the SMOTEENN with logistic regression model on the dataset.

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

In this case, we can see that the addition of SMOTEENN improves the performance of the default LR algorithm, achieving a mean G-mean of 0.

852 compared to 0.

621 seen in the first set of experimental results.

This is even better than balanced LR without any data scaling (previous section) that achieved a G-mean of about 0.

846.

The results suggest that perhaps the final combination of normalization, power transform, and standardization achieves a slightly better score than the default LR with SMOTEENN with a G-mean of about 0.

873, although the warning messages suggest some problems that need to be ironed out.

The distribution of results can be compared with box and whisker plots.

We can see the distributions all roughly have the same tight spread and that the difference in means of the results can be used to select a model.

Box and Whisker Plot of Logistic Regression Models with Data Sampling on the Imbalanced Oil Spill DatasetThe use of SMOTEENN with Logistic Regression directly without any data scaling probably provides the simplest and well-performing model that could be used going forward.

This model had a mean G-mean of about 0.

852 on our test harness.

We will use this as our final model and use it to make predictions on new data.

First, we can define the model as a pipeline.

Once defined, we can fit it on the entire training dataset.

Once fit, we can use it to make predictions for new data by calling the predict() function.

This will return the class label of 0 for no oil spill, or 1 for an oil spill.

For example:To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know there is no oil spill, and a few where we know there is.

The complete example is listed below.

Running the example first fits the model on the entire training dataset.

Then the fit model used to predict the label of an oil spill for cases where we know there is none, chosen from the dataset file.

We can see that all cases are correctly predicted.

Then some cases of actual oil spills are used as input to the model and the label is predicted.

As we might have hoped, the correct labels are again predicted.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to develop a model to predict the presence of an oil spill in satellite images and evaluate it using the G-mean metric.

Specifically, you learned:Do you have any questions? Ask your questions in the comments below and I will do my best to answer.

with just a few lines of python codeDiscover how in my new Ebook: Imbalanced Classification with PythonIt provides self-study tutorials and end-to-end projects on: Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms and much more.

.