Developing a probabilistic model is challenging in general, although it is made more so when there is skew in the distribution of cases, referred to as an imbalanced dataset.

The Haberman Dataset describes the five year or greater survival of breast cancer patient patients in the 1950s and 1960s and mostly contains patients that survive.

This standard machine learning dataset can be used as the basis of developing a probabilistic model that predicts the probability of survival of a patient given a few details of their case.

Given the skewed distribution in cases in the dataset, careful attention must be paid to both the choice of predictive models to ensure that calibrated probabilities are predicted, and to the choice of model evaluation to ensure that the models are selected based on the skill of their predicted probabilities rather than crisp survival vs.

non-survival class labels.

In this tutorial, you will discover how to develop a model to predict the probability of patient survival on an imbalanced dataset.

After completing this tutorial, you will know:Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

How to Develop a Probabilistic Model of Breast Cancer Patient SurvivalPhoto by Tanja-Milfoil, some rights reserved.

This tutorial is divided into five parts; they are:In this project, we will use a small breast cancer survival dataset, referred to generally as the “Haberman Dataset.

”The dataset describes breast cancer patient data and the outcome is patient survival.

Specifically whether the patient survived for five years or longer, or whether the patient did not survive.

This is a standard dataset used in the study of imbalanced classification.

According to the dataset description, the operations were conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital.

There are 306 examples in the dataset, and there are 3 input variables; they are:As such, we have no control over the selection of cases that make up the dataset or features to use in those cases, other than what is available in the dataset.

Although the dataset describes breast cancer patient survival, given the small dataset size and the fact the data is based on breast cancer diagnosis and operations many decades ago, any models built on this dataset are not expected to generalize.

To be crystal clear, we are not “solving breast cancer.

” We are exploring a standard imbalanced classification dataset.

You can learn more about the dataset here:We will choose to frame this dataset as the prediction of a probability of patient survival.

That is:Given patient breast cancer surgery details, what is the probability of survival of the patient to five years or more?This will provide the basis for exploring probabilistic algorithms that can predict a probability instead of a class label and metrics for evaluating models that predict probabilities instead of class labels.

Next, let’s take a closer look at the data.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-CourseFirst, download the dataset and save it in your current working directory with the name “haberman.

csv“.

Review the contents of the file.

The first few lines of the file should look as follows:We can see that the patients have an age like 30 or 31 (column 1), that operations occurred in years like 64 and 62 for 1964 and 1962 respectively (column 2), and that “axillary nodes” has values like 1 and 0.

All values are numeric; specifically, they are integer.

There are no missing values marked with a “?” character.

We can also see that the class label (column 3) has a value of either 1 for patient survival and 2 for patient non-survival.

Firstly, we can load the CSV dataset and summarize each column using a five-number summary.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location and the names of the columns as there is no header line.

We can then call the describe() function to create a report of the five-number summary of each column and print the contents of the report.

A five-number summary for a column includes useful details like the min and max values, the mean and standard deviation of which are useful if the variable has a Gaussian distribution, and the 25th, 50th, and 75th quartiles, which are useful if the variable does not have a Gaussian distribution.

Tying this together, the complete example of loading and summarizing the dataset columns is listed below.

Running the example loads the dataset and reports a five-number summary for each of the three input variables and the output variable.

Looking at the age, we can see that the youngest patient was 30 and the oldest was 83; that is quite a spread.

The mean patient age was about 52 years.

If the occurrence of cancer is somewhat random, we might expect this distribution to be Gaussian.

We can see that all operations were performed between 1958 and 1969.

If the number of breast cancer patients is somewhat fixed over time, we might expect this variable to have a uniform distribution.

We can see nodes have values between 0 and 52.

This might be a cancer diagnostic related to lymphatic nodes.

All variables are integers.

Therefore, it might be helpful to look at each variable as a histogram to get an idea of the variable distribution.

This might be helpful in case we choose models later that are sensitive to the data distribution or scale of the data, in which case, we might need to transform or rescale the data.

We can create a histogram of each variable in the DataFrame by calling the hist() function.

The complete example is listed below.

Running the example creates a histogram for each variable.

We can see that age appears to have a Gaussian distribution, as we might have expected.

We can also see that year has a uniform distribution, mostly, with an outlier in the first year showing nearly double the number of operations.

We can see nodes has an exponential type distribution with perhaps most examples showing 0 nodes, with a long tail of values after that.

A transform to un-bunch this distribution might help some models later on.

Finally, we can see the two-class values with an unequal class distribution, showing perhaps 2- or 3-times more survival than non-survival cases.

Histogram of Each Variable in the Haberman Breast Cancer Survival DatasetIt may be helpful to know how imbalanced the dataset actually is.

We can use the Counter object to count the number of examples in each class, then use those counts to summarize the distribution.

The complete example is listed below.

Running the example summarizes the class distribution for the dataset.

We can see that class 1 for survival has the most examples at 225, or about 74 percent of the dataset.

We can see class 2 for non-survival has fewer examples at 81, or about 26 percent of the dataset.

The class distribution is skewed, but it is not severely imbalanced.

Now that we have reviewed the dataset, let’s look at developing a test harness for evaluating candidate models.

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split.

We will use k=10, meaning each fold will contain 306/10 or about 30 examples.

Stratified means that each fold will contain the same mixture of examples by class, that is about 74 percent to 26 percent survival and non-survival.

Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model.

We will use three repeats.

This means a single model will be fit and evaluated 10 * 3, or 30, times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

Given that we are interested in predicting a probability of survival, we need a performance metric that evaluates the skill of a model based on the predicted probabilities.

In this case, we will the Brier score that calculates the mean squared error between the predicted probabilities and the expected probabilities.

This can be calculated using the brier_score_loss() scikit-learn function.

This score is minimized, with a perfect score of 0.

0.

We can invert the score to be maximizing by comparing a predicted score to a reference score, showing how much better the model is compared to the reference between 0.

0 for the same, to 1.

0 with perfect skill.

Any models that achieves a score less than 0.

0 represents less skill than the reference model.

This is called the Brier Skill Score, or BSS for short.

It is customary for an imbalanced dataset to model the minority class as a positive class.

In this dataset, the positive class represents non-survival.

This means, that we will be predicting the probability of non-survival and will need to calculate the complement of the predicted probability in order to get the probability of survival.

As such, we can map the 1 class values (survival) to the negative case with a 0 class label, and the 2 class values (non-survival) to the positive case with a class label of 1.

This can be achieved using the LabelEncoder class.

For example, the load_dataset() function below will load the dataset, split the variable columns into input and outputs, and then encode the target variable to 0 and 1 values.

Next, we can calculate the Brier skill score for a model.

First, we need a Brier score for a reference prediction.

A reference prediction for a problem in which we are predicting probabilities is the probability of the positive class label in the dataset.

In this case, the positive class label represents non-survival and occurs about 26% in the dataset.

Therefore, predicting about 0.

26471 represents the worst-case or baseline performance for a predictive model on this dataset.

Any model that has a Brier score better than this has some skill, where as any model that as a Brier score lower than this has no skill.

The Brier Skill Score captures this important relationship.

We can calculate the Brier score for this default prediction strategy automatically for each training set in the k-fold cross-validation process, then use it as a point of comparison for a given model.

The Brier score can then be calculated for the predictions from a model and used in the calculation of the Brier Skill Score.

The brier_skill_score() function below implements this and calculates the Brier Skill Score for a given set of true labels and predictions on the same test set.

Any model that achieves a BSS above 0.

0 means it shows skill on this dataset.

Next, we can make use of the brier_skill_score() function to evaluate a model using repeated stratified k-fold cross-validation.

To use our custom performance metric, we can use the make_scorer() scikit-learn function that takes the name of our custom function and creates a metric that we can use to evaluate models with the scikit-learn API.

We will set the needs_proba argument to True to ensure that models that are evaluated make predictions using the predict_proba() function to ensure they give probabilities instead of class labels.

The evaluate_model() function below defines the evaluation procedure with our custom evaluation metric, taking the entire training dataset and model as input, then returns the sample of scores across each fold and each repeat.

Finally, we can use our three functions and evaluate a model.

First, we can load the dataset and summarize the input and output arrays to confirm they were loaded correctly.

In this case, we will evaluate the baseline strategy of predicting the distribution of positive examples in the training set as the probability of each case in the test set.

This can be implemented automatically using the DummyClassifier class and setting the “strategy” to “prior” that will predict the prior probability of each class in the training dataset, which for the positive class we know is about 0.

26471.

We can then evaluate the model by calling our evaluate_model() function and report the mean and standard deviation of the results.

Tying this all together, the complete example of evaluating the baseline model on the Haberman breast cancer survival dataset using the Brier Skill Score is listed below.

We would expect the baseline model to achieve a BSS of 0.

0, e.

g.

the same as the reference model because it is the reference model.

Running the example first loads the dataset and reports the number of cases correctly as 306 and the distribution of class labels for the negative and positive cases as we expect.

The DummyClassifier with our default strategy is then evaluated using repeated stratified k-fold cross-validation and the mean and standard deviation of the Brier Skill Score is reported as 0.

0.

This is as we expected, as we are using the test harness to evaluate the reference strategy.

Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.

In this section, we will use the test harness developed in the previous section to evaluate a suite of algorithms and then improvements to those algorithms, such as data preparation schemes.

We will evaluate a suite of models that are known to be effective at predicting probabilities.

Specifically, these are models that are fit under a probabilistic framework and explicitly predict a calibrated probability for each example.

A such, this makes them well-suited to this dataset, even with the class imbalance.

We will evaluate the following six probabilistic models implemented with the scikit-learn library:We are interested in directly comparing the results from each of these algorithms.

We will compare each algorithm based on the mean score, as well as based on their distribution of scores.

We can define a list of the models that we want to evaluate, each with their default configuration or configured as to not produce a warning.

We can then enumerate each model, record a unique name for the model, evaluate it, and report the mean BSS and store the results for the end of the run.

At the end of the run, we can then create a box and whisker plot that shows the distribution of results from each algorithm, where the box shows the 25th, 50th, and 75th percentiles of the scores and the triangle shows the mean result.

The whiskers of each plot give an idea of the extremes of each distribution.

Tying this together, the complete example is listed below.

Running the example first summarizes the mean and standard deviation of the BSS for each algorithm (larger scores is better).

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

In this case, the results suggest that only two of the algorithms are not skillful, showing negative scores, and that perhaps the LogisticRegression (LR) and LinearDiscriminantAnalysis (LDA) algorithms are the best performing.

A box and whisker plot is created summarizing the distribution of results.

Interestingly, most if not all algorithms show a spread indicating that they may be unskilful on some of the runs.

The distribution between the two topperforming models appears roughly equivalent, so choosing a model based on mean performance might be a good start.

Box and Whisker Plot of Probabilistic Models on the Haberman Breast Cancer Survival DatasetThis is a good start; let’s see if we can improve the results with basic data preparation.

It can be a good practice to scale data for some algorithms if the variables have different units of measure, as they do in this case.

Algorithms like the LR and LDA are sensitive to the of the data and assume a Gaussian distribution for the input variables, which we don’t have in all cases.

Nevertheless, we can test the algorithms with standardization, where each variable is shifted to a zero mean and unit standard deviation.

We will drop the MultinomialNB algorithm as it does not support negative input values.

We can achieve this by wrapping each model in a Pipeline where the first step is a StandardScaler, which will correctly be fit on the training dataset and applied to the test dataset within each k-fold cross-validation evaluation, preventing any data leakage.

The complete example of evaluating the five remaining algorithms with standardized input data is listed below.

Running the example again summarizes the mean and standard deviation of the BSS for each algorithm.

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

In this case, we can see that the standardization has not had much of an impact on the algorithms, except the Gaussian Process Classifier (GPC).

The performance of the GPC with standardization has shot up and is now the best-performing technique.

This highlights the importance of preparing data to meet the expectations of each model.

Box and whisker plots for each algorithm’s results are created, showing the difference in mean performance (green triangles) and the similar spread in scores between the three top-performing methods.

This suggests all three probabilistic methods are discovering the same general mapping of inputs to probabilities in the dataset.

Box and Whisker Plot of Probabilistic Models With Data Standardization on the Haberman Breast Cancer Survival DatasetThere is further data preparation to make the input variables more Gaussian, such as power transforms.

Power transforms, such as the Box-Cox and Yeo-Johnson transforms, are designed to change the distribution to be more Gaussian.

This will help with the “age” input variable in our dataset and may help with the “nodes” variable and un-bunch the distribution slightly.

We can use the PowerTransformer scikit-learn class to perform the Yeo-Johnson and automatically determine the best parameters to apply based on the dataset, e.

g.

how to best make each variable more Gaussian.

Importantly, this transformer will also standardize the dataset as part of the transform, ensuring we keep the gains seen in the previous section.

The power transform may make use of a log() function, which does not work on zero values.

We have zero values in our dataset, therefore we will scale the dataset prior to the power transform using a MinMaxScaler.

Again, we can use this transform in a Pipeline to ensure it is fit on the training dataset and applied to the train and test datasets correctly, without data leakage.

We will focus on the three top-performing methods, in this case, LR, LDA, and GPC.

The complete example is listed below.

Running the example again summarizes the mean and standard deviation of the BSS for each algorithm.

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

In this case, we can see a further lift in model skill for the three models that were evaluated.

We can see that the LR appears to have out-performed the other two methods.

Box and whisker plots are created for the results from each algorithm, suggesting perhaps a smaller and more focused spread for LR compared to the LDA, which was the second-best performing method.

All methods still show skill on average, however the distribution of scores show runs that drop below 0.

0 (no skill) in some cases.

Box and Whisker Plot of Probabilistic Models With Power Transforms on the Haberman Breast Cancer Survival DatasetWe will select the Logistic Regression model with a power transform on the input data as our final model.

We can define and fit this model on the entire training dataset.

Once fit, we can use it to make predictions for new data by calling the predict_proba() function.

This will return two probabilities for each prediction, the first for survival and the second for non-survival, e.

g.

its complement.

For example:To demonstrate this, we can use the fit model to make some predictions of probability for a few cases where we know there is survival and a few where we know there is not.

The complete example is listed below.

Running the example first fits the model on the entire training dataset.

Then the fit model is used to predict the probability of survival for cases where we know the patient survived, chosen from the dataset file.

We can see that for the chosen survival cases, the probability of survival was high, between 77 percent and 86 percent.

Then some cases of non-survival are used as input to the model and the probability of survival is predicted.

As we might have hoped, the probability of survival is modest, hovering around 53 percent to 63 percent.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to develop a model to predict the probability of patient survival on an imbalanced dataset.

Specifically, you learned:Do you have any questions? Ask your questions in the comments below and I will do my best to answer.

with just a few lines of python codeDiscover how in my new Ebook: Imbalanced Classification with PythonIt provides self-study tutorials and end-to-end projects on: Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms and much more.

.