Misclassification errors on the minority class are more important than other types of prediction errors for some imbalanced classification tasks.

One example is the problem of classifying bank customers as to whether they should receive a loan or not.

Giving a loan to a bad customer marked as a good customer results in a greater cost to the bank than denying a loan to a good customer marked as a bad customer.

This requires careful selection of a performance metric that both promotes minimizing misclassification errors in general, and favors minimizing one type of misclassification error over another.

The German credit dataset is a standard imbalanced classification dataset that has this property of differing costs to misclassification errors.

Models evaluated on this dataset can be evaluated using the Fbeta-Measure that provides a way of both quantifying model performance generally, and captures the requirement that one type of misclassification error is more costly than another.

In this tutorial, you will discover how to develop and evaluate a model for the imbalanced German credit classification dataset.

After completing this tutorial, you will know:Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

Develop an Imbalanced Classification Model to Predict Good and Bad CreditPhoto by AL Nieves, some rights reserved.

This tutorial is divided into five parts; they are:In this project, we will use a standard imbalanced machine learning dataset referred to as the “German Credit” dataset or simply “German.

”The dataset was used as part of the Statlog project, a European-based initiative in the 1990s to evaluate and compare a large number (at the time) of machine learning algorithms on a range of different classification tasks.

The dataset is credited to Hans Hofmann.

The fragmentation amongst different disciplines has almost certainly hindered communication and progress.

The StatLog project was designed to break down these divisions by selecting classification procedures regardless of historical pedigree, testing them on large-scale and commercially important problems, and hence to determine to what extent the various techniques met the needs of industry.

— Page 4, Machine Learning, Neural and Statistical Classification, 1994.

The german credit dataset describes financial and banking details for customers and the task is to determine whether the customer is good or bad.

The assumption is that the task involves predicting whether a customer will pay back a loan or credit.

The dataset includes 1,000 examples and 20 input variables, 7 of which are numerical (integer) and 13 are categorical.

Some of the categorical variables have an ordinal relationship, such as “Savings account,” although most do not.

There are two classes, 1 for good customers and 2 for bad customers.

Good customers are the default or negative class, whereas bad customers are the exception or positive class.

A total of 70 percent of the examples are good customers, whereas the remaining 30 percent of examples are bad customers.

A cost matrix is provided with the dataset that gives a different penalty to each misclassification error for the positive class.

Specifically, a cost of five is applied to a false negative (marking a bad customer as good) and a cost of one is assigned for a false positive (marking a good customer as bad).

This suggests that the positive class is the focus of the prediction task and that it is more costly to the bank or financial institution to give money to a bad customer than to not give money to a good customer.

This must be taken into account when selecting a performance metric.

Next, let’s take a closer look at the data.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-CourseFirst, download the dataset and save it in your current working directory with the name “german.

csv“.

Review the contents of the file.

The first few lines of the file should look as follows:We can see that the categorical columns are encoded with an Axxx format, where “x” are integers for different labels.

A one-hot encoding of the categorical variables will be required.

We can also see that the numerical variables have different scales, e.

g.

6, 48, and 12 in column 2, and 1169, 5951, etc.

in column 5.

This suggests that scaling of the integer columns will be needed for those algorithms that are sensitive to scale.

The target variable or class is the last column and contains values of 1 and 2.

These will need to be label encoded to 0 and 1, respectively, to meet the general expectation for imbalanced binary classification tasks where 0 represents the negative case and 1 represents the positive case.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location and the fact that there is no header line.

Once loaded, we can summarize the number of rows and columns by printing the shape of the DataFrame.

We can also summarize the number of examples in each class using the Counter object.

Tying this together, the complete example of loading and summarizing the dataset is listed below.

Running the example first loads the dataset and confirms the number of rows and columns, that is 1,000 rows and 20 input variables and 1 target variable.

The class distribution is then summarized, confirming the number of good and bad customers and the percentage of cases in the minority and majority classes.

We can also take a look at the distribution of the seven numerical input variables by creating a histogram for each.

First, we can select the columns with numeric variables by calling the select_dtypes() function on the DataFrame.

We can then select just those columns from the DataFrame.

We would expect there to be seven, plus the numerical class labels.

We can then create histograms of each numeric input variable.

The complete example is listed below.

Running the example creates the figure with one histogram subplot for each of the seven input variables and one class label in the dataset.

The title of each subplot indicates the column number in the DataFrame (e.

g.

zero-offset from 0 to 20).

We can see many different distributions, some with Gaussian-like distributions, others with seemingly exponential or discrete distributions.

Depending on the choice of modeling algorithms, we would expect scaling the distributions to the same range to be useful, and perhaps the use of some power transforms.

Histogram of Numeric Variables in the German Credit DatasetNow that we have reviewed the dataset, let’s look at developing a test harness for evaluating candidate models.

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split.

We will use k=10, meaning each fold will contain about 1000/10 or 100 examples.

Stratified means that each fold will contain the same mixture of examples by class, that is about 70 percent to 30 percent good to bad customers.

Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model.

We will use three repeats.

This means a single model will be fit and evaluated 10 * 3 or 30 times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

We will predict class labels of whether a customer is good or not.

Therefore, we need a measure that is appropriate for evaluating the predicted class labels.

The focus of the task is on the positive class (bad customers).

Precision and recall are a good place to start.

Maximizing precision will minimize the false positives and maximizing recall will minimize the false negatives in the predictions made by a model.

Using the F-Measure will calculate the harmonic mean between precision and recall.

This is a good single number that can be used to compare and select a model on this problem.

The issue is that false negatives are more damaging than false positives.

Remember that false negatives on this dataset are cases of a bad customer being marked as a good customer and being given a loan.

False positives are cases of a good customer being marked as a bad customer and not being given a loan.

False negatives are more costly to the bank than false positives.

Put another way, we are interested in the F-measure that will summarize a model’s ability to minimize misclassification errors for the positive class, but we want to favor models that are better are minimizing false negatives over false positives.

This can be achieved by using a version of the F-measure that calculates a weighted harmonic mean of precision and recall but favors higher recall scores over precision scores.

This is called the Fbeta-measure, a generalization of F-measure, where “beta” is a parameter that defines the weighting of the two scores.

A beta value of 2 will weight more attention on recall than precision and is referred to as the F2-measure.

We will use this measure to evaluate models on the German credit dataset.

This can be achieved using the fbeta_score() scikit-learn function.

We can define a function to load the dataset and split the columns into input and output variables.

We will one-hot encode the categorical variables and label encode the target variable.

You might recall that a one-hot encoding replaces the categorical variable with one new column for each value of the variable and marks values with a 1 in the column for that value.

First, we must split the DataFrame into input and output variables.

Next, we need to select all input variables that are categorical, then apply a one-hot encoding and leave the numerical variables untouched.

This can be achieved using a ColumnTransformer and defining the transform as a OneHotEncoder applied only to the column indices for categorical variables.

We can then label encode the target variable.

The load_dataset() function below ties all of this together and loads and prepares the dataset for modeling.

Next, we need a function that will evaluate a set of predictions using the fbeta_score() function with beta set to 2.

We can then define a function that will evaluate a given model on the dataset and return a list of F2-Measure scores for each fold and repeat.

The evaluate_model() function below implements this, taking the dataset and model as arguments and returning the list of scores.

Finally, we can evaluate a baseline model on the dataset using this test harness.

A model that predicts the minority class for examples will achieve a maximum recall score and a baseline precision score.

This provides a baseline in model performance on this problem by which all other models can be compared.

This can be achieved using the DummyClassifier class from the scikit-learn library and setting the “strategy” argument to “constant” and the “constant” argument to “1” for the minority class.

Once the model is evaluated, we can report the mean and standard deviation of the F2-Measure scores directly.

Tying this together, the complete example of loading the German Credit dataset, evaluating a baseline model, and reporting the performance is listed below.

Running the example first loads and summarizes the dataset.

We can see that we have the correct number of rows loaded, and through the one-hot encoding of the categorical input variables, we have increased the number of input variables from 20 to 61.

That suggests that the 13 categorical variables were encoded into a total of 54 columns.

Importantly, we can see that the class labels have the correct mapping to integers with 0 for the majority class and 1 for the minority class, customary for imbalanced binary classification dataset.

Next, the average of the F2-Measure scores is reported.

In this case, we can see that the baseline algorithm achieves an F2-Measure of about 0.

682.

This score provides a lower limit on model skill; any model that achieves an average F2-Measure above about 0.

682 has skill, whereas models that achieve a score below this value do not have skill on this dataset.

Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.

In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.

The goal is to both demonstrate how to work through the problem systematically and to demonstrate the capability of some techniques designed for imbalanced classification problems.

The reported performance is good, but not highly optimized (e.

g.

hyperparameters are not tuned).

Can you do better? If you can achieve better F2-Measure performance using the same test harness, I’d love to hear about it.

Let me know in the comments below.

Let’s start by evaluating a mixture of probabilistic machine learning models on the dataset.

It can be a good idea to spot check a suite of different linear and nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention, and what doesn’t.

We will evaluate the following machine learning models on the German credit dataset:We will use mostly default model hyperparameters.

We will define each model in turn and add them to a list so that we can evaluate them sequentially.

The get_models() function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.

We can then enumerate the list of models in turn and evaluate each, storing the scores for later evaluation.

We will one-hot encode the categorical input variables as we did in the previous section, and in this case, we will normalize the numerical input variables.

This is best performed using the MinMaxScaler within each fold of the cross-validation evaluation process.

An easy way to implement this is to use a Pipeline where the first step is a ColumnTransformer that applies a OneHotEncoder to just the categorical variables, and a MinMaxScaler to just the numerical input variables.

To achieve this, we need a list of the column indices for categorical and numerical input variables.

We can update the load_dataset() to return the column indexes as well as the input and output elements of the dataset.

The updated version of this function is listed below.

We can then call this function to get the data and the list of categorical and numerical variables.

This can be used to prepare a Pipeline to wrap each model prior to evaluating it.

First, the ColumnTransformer is defined, which specifies what transform to apply to each type of column, then this is used as the first step in a Pipeline that ends with the specific model that will be fit and evaluated.

We can summarize the mean F2-Measure for each algorithm; this will help to directly compare algorithms.

At the end of the run, we will create a separate box and whisker plot for each algorithm’s sample of results.

These plots will use the same y-axis scale so we can compare the distribution of results directly.

Tying this all together, the complete example of evaluating a suite of machine learning algorithms on the German credit dataset is listed below.

Running the example evaluates each algorithm in turn and reports the mean and standard deviation F2-Measure.

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

In this case, we can see that none of the tested models have an F2-measure above the default of predicting the majority class in all cases (0.

682).

None of the models are skillful.

This is surprising, although suggests that perhaps the decision boundary between the two classes is noisy.

A figure is created showing one box and whisker plot for each algorithm’s sample of results.

The box shows the middle 50 percent of the data, the orange line in the middle of each box shows the median of the sample, and the green triangle in each box shows the mean of the sample.

Box and Whisker Plot of Machine Learning Models on the Imbalanced German Credit DatasetNow that we have some results, let’s see if we can improve them with some undersampling.

Undersampling is perhaps the least widely used technique when addressing an imbalanced classification task as most of the focus is put on oversampling the majority class with SMOTE.

Undersampling can help to remove examples from the majority class along the decision boundary that make the problem challenging for classification algorithms.

In this experiment we will test the following undersampling algorithms:The Tomek Links and ENN methods select examples from the majority class to delete, whereas OSS and NCR both select examples to keep and examples to delete.

We will use the balanced version of the logistic regression algorithm to test each undersampling method, to keep things simple.

The get_models() function from the previous section can be updated to return a list of undersampling techniques to test with the logistic regression algorithm.

We use the implementations of these algorithms from the imbalanced-learn library.

The updated version of the get_models() function defining the undersampling methods is listed below.

The Pipeline provided by scikit-learn does not know about undersampling algorithms.

Therefore, we must use the Pipeline implementation provided by the imbalanced-learn library.

As in the previous section, the first step of the pipeline will be one hot encoding of categorical variables and normalization of numerical variables, and the final step will be fitting the model.

Here, the middle step will be the undersampling technique, correctly applied within the cross-validation evaluation on the training dataset only.

Tying this together, the complete example of evaluating logistic regression with different undersampling methods on the German credit dataset is listed below.

We would expect the undersampling to to result in a lift on skill in logistic regression, ideally above the baseline performance of predicting the minority class in all cases.

The complete example is listed below.

Running the example evaluates the logistic regression algorithm with five different undersampling techniques.

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

In this case, we can see that three of the five undersampling techniques resulted in an F2-measure that provides an improvement over the baseline of 0.

682.

Specifically, ENN, RENN and NCR, with repeated edited nearest neighbors resulting in the best performance with an F2-measure of about 0.

716.

The results suggest SMOTE achieved the best score with an F2-Measure of 0.

604.

Box and whisker plots are created for each evaluated undersampling technique, showing that they generally have the same spread.

It is encouraging to see that for the well performing methods, the boxes spread up around 0.

8, and the mean and median for all three methods are are around 0.

7.

This highlights that the distributions are skewing high and are let down on occasion by a few bad evaluations.

Box and Whisker Plot of Logistic Regression With Undersampling on the Imbalanced German Credit DatasetNext, let’s see how we might use a final model to make predictions on new data.

This is a new section that provides a minor departure to the above section.

Here, we will test specific models that result in a further lift in F2-measure performance and I will update this section as new models are reported/discovered.

An F2-measure of about 0.

727 can be achieved using balanced Logistic Regression with InstanceHardnessThreshold undersampling.

The complete example is listed below.

Running the example gives the follow results, your results may vary given the stochastic nature of the learning algorithm.

An F2-measure of about 0.

730 can be achieved using LDA with SMOTEENN, where the ENN parameter is set to an ENN instance with sampling_strategy set to majority.

The complete example is listed below.

Running the example gives the follow results, your results may vary given the stochastic nature of the learning algorithm.

An F2-measure of about 0.

741 can be achieved with further improvements to the SMOTEENN using a RidgeClassifier instead of LDA and using a StandardScaler for the numeric inputs instead of a MinMaxScaler.

The complete example is listed below.

Running the example gives the follow results, your results may vary given the stochastic nature of the learning algorithm.

Can you do even better? Let me know in the comments below.

Given the variance in results, a selection of any of the undersampling methods is probably sufficient.

In this case, we will select logistic regression with Repeated ENN.

This model had an F2-measure of about about 0.

716 on our test harness.

We will use this as our final model and use it to make predictions on new data.

First, we can define the model as a pipeline.

Once defined, we can fit it on the entire training dataset.

Once fit, we can use it to make predictions for new data by calling the predict() function.

This will return the class label of 0 for “good customer”, or 1 for “bad customer”.

Importantly, we must use the ColumnTransformer that was fit on the training dataset in the Pipeline to correctly prepare new data using the same transforms.

For example:To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know if the case is a good customer or bad.

The complete example is listed below.

Running the example first fits the model on the entire training dataset.

Then the fit model used to predict the label of a good customer for cases chosen from the dataset file.

We can see that most cases are correctly predicted.

This highlights that although we chose a good model, it is not perfect.

Then some cases of actual bad customers are used as input to the model and the label is predicted.

As we might have hoped, the correct labels are predicted for all cases.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to develop and evaluate a model for the imbalanced German credit classification dataset.

Specifically, you learned:Do you have any questions? Ask your questions in the comments below and I will do my best to answer.

with just a few lines of python codeDiscover how in my new Ebook: Imbalanced Classification with PythonIt provides self-study tutorials and end-to-end projects on: Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms and much more.

.