An Introduction to Evaluating Classification Models

Let’s look at the actual distribution of outcome.

Target Class Distribution of Full Datasetsns.

countplot(x=fraud['Class'])Here, the class imbalance, or the skewed distribution of outcomes is obvious.

A quick check using the .

value_counts() function shows that we have 137 counts of fraud cases in our data, and 1339 counts of non-fraud cases, meaning that 9.

3% of our test cases are fraudulent.

If we were to apply a dummy model that blindly predicts all observations to be a non-fraud case, as can be done with the sklearn.

dummy package, we can see that our reported accuracy of 90.

7% matches that of our frequency distribution (90.



So, even though our dummy model is quite useless, we can still report that it performs at a 90% accuracy.

The confusion matrix below shows this inability to distinguish in further detail.


value_counts() / len(y_test)dum = DummyClassifier(strategy='most_frequent')Confusion Matrix from Dummy ClassifierWhat measures can we use instead?The biggest perk of using the confusion matrix is that we can easily pull out various other values that reflect how well our model runs relative what we’d expect from a dummy model.

Sensitivity (or Recall, or True Positive Rate)As we saw with the above example, a good model should successfully detect close to all of the actual fraudulent cases, given how much higher a cost of not catching a fraud case is relative to putting a non-fraudulent transaction under scrutiny by incorrectly suggesting that it was a fraud case.

Sensitivity, also known as recall, quantifies that intuition, and reflects the ratio of correctly classified positives to actual positive cases.

Sensitivity = TP / (TP + FN)Interpretation of sensitivity is fairly straightforward.

All values range between 0 and 1, where a value of 1 indicates that the model detected every single case of fraud, while a value of 0 indicates that all the actual cases of fraud were not detected.

With our logistic regression model, we have a sensitivity of 108 / 137 = 0.


SpecificitySpecificity helps us determine how many were correctly classified as non-fraud out of total true non-fraud cases.

The false positive rate, or the false alarm rate, is the opposite of specificity.

In our fraud detection model, we have a specificity of 1334/1339 = 0.

996Specificity = TN / (TN + FP)False-positive rate= 1 — SpecificityIn this particular case, specificity is a less relevant metric as the cost of classifying a non-fraud case as a fraud case is lower than missing a fraud case entirely.

But there are cases where false alarms are equally undesirable, such as in disease detection, where misdiagnosis would lead to unnecessary follow-up procedures.

PrecisionOn the other hand, we may want to test the certainty of our predictions, for example, we may in interested how many of the fraud cases that our model says it picked up were truly fraudulent.

Precision does just that, by providing the proportion of true positives relative to the number of predicted positives.

Intuitively, a low precision would mean that we’re giving a lot of customers headaches, in that we’re classifying more fraudulent transactions than what’s actually fraudulent.

With the logistic regression model, our precision is 108/113 = 0.

956Precision = TP / (TP + FP)Should we maximize specificity or sensitivity?In our fraud dataset, it’s more important that we minimize the number of fraud cases that go undetected, even if it comes at a cost of incorrectly classifying non-fraud cases as fraudulent, simply because the cost to the firm is far higher for the former case (potentially thousands of dollars in lost revenue or a few minutes of a customer’s time to verify their transaction).

In other words, we would rather commit a type I error than a type II error.

Although we would prefer a model that maximizes both sensitivity and specificity, we would prefer a model with a maximum sensitivity, as it minimizes the occurrences of a type II error.

F1 ScoreLast but not least, the F1 score summarizes both precision and recall and can be understood as the harmonic mean of the two measures.

An F1 score of 1 indicates perfect precision and recall, therefore the higher the F1 score, the better the model.

The logistic model here shows a F1 score of 0.


F1 = 2 * (Precision * Sensitivity) / (Precision + Sensitivity)What happens when we change the probability cutoff?How exactly are we coming up with our predictions?.From our logistic regression, we compute a predicted probability of a given observation being fraudulent that falls between 0 to 1.

We say that all probabilities greater than 0.

5 should indicate a prediction of a fraud, while all values less than 0.

5 return a prediction of a legitimate transaction.

But given how much more willing we are to make a Type I error, wouldn’t it be better to classify a case as fraudulent even if there’s only a slight probability that it is?.In other words, what if we lower our threshold for discriminating between a fraud and non-fraud, such that we catch more frauds, from shifting it down from the orange to green line?Orange = Discrimination Threshold of 0.

5; Green = Discrimination Threshold of 0.

1From the definitions we discussed previously, the specificity of the model would increase, as we’ve now classified more positives, but at the same time, we increase the likelihood that we’re incorrectly labelling a non-fraudulent case as a fraud, thus dropping the sensitivity as the number of false negatives is increasing.

It’s a constant trade-off, and the rate at which the sensitivity increases from dropping specificity is an attribute specific to each model.

Thus far, we’ve been reporting our metrics from a confusion matrix calculated at a threshold of 0.

5 (the orange line), but if we were to lower the discrimination threshold to 0.

1 (the green line), the sensitivity increases from .

788 to .

883, while the specificity drops from .

996 to .


classification_report(y_test, y_pred)ROC CurveWe can see the change in these values at all thresholds using a Receiver Operating Characteristic, or ROC Curve, which plots the true positive rate against the false positive rate, or the sensitivity against 1-specificity for each threshold.

We can see here the ROC curve for our logistic classifier, in which the different thresholds are applied to the predicted probabilities to produce different true positive and false positive rates.

Our dummy model is a point at (0,0) on the blue curve where the discrimination threshold is such that any probability <1.

0 is predicted as a non-fraud case.

This indicates that though the we correctly classified all of the non-fraud cases, we incorrectly classified all non-fraud cases.

A perfectly discriminating model, on the other hand, would have a point on curve at (0,1), which would indicate our model perfectly classifying all fraud cases as well as non-fraud cases.

The lines for both of these model would be generated as a function of the threshold level.

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob[:,1])The diagonal line shows where true positive rate is the same as false positive rate, where there is an equal chance of correctly detecting a fraud case and detecting a non-fraud case as fraudulent.

This means that any ROC curve above the diagonal does better than random chance of predicting outcomes, assuming a 50/50 class balance.

Thus, all ROC curves that we’ll encounter in the field will be drawn above the y=x line, and below the line going vertically up to (0,1), then horizontally across to (1,1).

We can quantify the degree to which a ROC curve performs by looking at the area under the curve, or AUC ROC.

This value will adopt values between 0.

5 (of the diagonal line) and 1.

0 (for a perfect model).

Our fraud detection model performs with an AUC ROC of 0.

934, not bad for an out-of-box model.

from sklearn.

metrics import aucauc = roc_auc_score(y_test,logis_pred_prob[:,1])PRC CurveGoing back to the logit model, what happens to precision and recall when we shift our discrimination threshold from the orange to green?.As we move from a threshold of 0.

5 to 0.

1, the recall increases from 0.

788 to 0.

883, since the proportion of correctly detected frauds over the actual number of frauds increases, while the precision drops from 0.

956 to 0.

338, as the proportion of true fraud cases over the predicted number of fraud cases decreases.

Just as we did with the ROC curve, we can plot the trade-off between precision and recall as a function of the thresholds, and obtain a Precision-Recall Curve, or PRC.

precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob[:,1])Typically, PRCs are better suited for models trained on highly imbalanced datasets, as the high true negative value used in the formulation of the ROC curve’s false positive rate can ‘inflate’ the perception of how well the model performs.

PRC curves avoid this value, and can thus reflect a less biased metric for the model’s performance.

We can summarize this curve succinctly using an average precision value or average F1 score (averaged across each threshold), with an ideal value close to 1.

from sklearn.

metrics import f1_scorefrom sklearn.

metrics import average_precision_scoref1 = f1_score(y_test, y_pred_prob)ap = average_precision_score(y_test, y_pred_prob)Parting ThoughtsQuick review of formulas:TPR = True Positive Rate; FPR = False Positive RateIn this post, we discussed how to move past accuracy as a measure of performance for a binary classifier.

We discussed sensitivity, specificity, precision and recall.

We looked at what happens when we change our decision threshold, and how we can visualize the trade-off between sensitivity and specificity, as well as between precision and recall.

As a reminder, all the visualizations and model produced can be found on the interactive python notebook, linked below.


com/ishaandey/Classification_Evaluation_WalkthroughThat’s it for this post.

Let us know if you have any questions!.. More details

Leave a Reply