Why measuring accuracy is hard (and very important)!

We find some acceptable point on the ROC curve, set the threshold to that point, and use these two new numbers to describe its outputs.

But this quickly gets complicated when you have multiple classes in your outputs — we now have a whole series of precisions and recalls — and it quickly becomes bewildering to try to explain how a model behaves and performs.

The measurement of accuracy can even be affected by raw mathematics, which can make an improvement actually look worse.

This example came to me when I was reading the Google AI blog article for a diabetes detection- algorithm: http://ai.



htmlThey mention that they had switched their algorithm from doing a binary classification (diabetic / not diabetic) to one on a 5 point severity scale.

The data being analyzed was the same — only the outputs have changed.

But now think about this: On a binary classification, the system can guess totally randomly and has a 50–50 chance of getting the correct answer.

On a 5 point grading system, the system only has a 20% chance of getting the correct answer when it guesses randomly.

So now imagine you switch from a 2 point to 5 point grading system in order to give your model richer information to learn from.

If you measure the raw-accuracy however, you might notice that your switch reduced the accuracy.

In the 5 point grading system, the system has a much lower chance of getting the correct answer by chance alone.

The authors of the article get around this by using Kappa score rather then just vanilla accuracy (which adjusts for the chance of a randomly correct answer).

But the basic problem gets to the heart of my topic in this article.

Measuring your accuracy just isn’t as straightforward as it might seem.

And how we measure accuracy will alter what changes we make to our model to improve it, how much we trust a model, and most importantly, how stakeholders like business people, engineers, government, healthcare, or social services organizations adopt, integrate and use these algorithms.

Why measuring accuracy correctly is importantWhy measure accuracy?.This might seem to be an easy to answer question.

Without measuring accuracy, there is no way to know if your model is working or not.

Unlike regular code, which can be tested with a prior assumption that it works perfectly, 100% of the time as designed, machine learning code is expected to fail on some number of samples.

So measuring that exact number of failures is the key to testing a machine learning system.

But I want to take a moment to touch on some of the reasons that we measure accuracy and why it becomes important to do it right.

Improving your ModelThe first and easiest to understand reason you measure the accuracy of a model is so that you can improve its accuracy.

When you are trying to improve the accuracy, almost any metric of accuracy is possible to use.

As long as that metric has a clearly defined better or worse, then the exact value of the metric doesn’t matter.

What you care about is whether the metric is improving or getting worse.

So what can go wrong?Well, if you measure the accuracy of your model incorrectly, you could actually be modifying your model in ways that would hurt your real-world performance, while they appear to be improving your metric.

Take, for example, the problem of generalization vs overfitting.

If you are measuring your accuracy incorrectly, you could be making changes that appear to improve your metric, but instead they are just making your model overfit the data you’re metric is measuring against.

The standard way of solving this problem is to break your data 80/20 into training / testing.

But this is also fraught with difficulty, since we sometimes use measurements of accuracy to do things like early-stopping or setting confidence thresholds, so that testing data itself then becomes part of you training process.

You could be over-fitting your confidence thresholds.

So then you decide to break your data three ways, 70/20/10, with an extra validation set you can measure your accuracy with at the end.

But now what if your dataset size is relatively small, or its not perfectly representative of the real world data it has to operate on.

You now have to worry about another type of over-fitting I call architectural overfit, where the design and parameters of your model became too perfectly designed for the dataset, and does not generalize to new samples or learn them very well when they are added to the dataset.

This can happen for example if you prepare a lot of custom made features based on your dataset, only to find they don’t apply when the dataset is grown over time, modified significantly, or merged with some other dataset.

You got excellent training, testing, and validation accuracy.

But your still overfitting the dataset.

Now what if your dataset has noise?.What if there are consistent mistakes in the data?.You might think your model is amazing — and indeed find that it is, it has perfectly learned the consistent patterns of mistakes in the dataset.

You gleefully put the model into the next product, only to have your ass handed to you when the product is launched.

It seems your model does not work so great in the real world.

“It was perfectly accurate in testing” you might think.

“What could have gone wrong?”What if your accuracy measurement itself has noise?.Let’s say you have a 1–3% spread in accuracy across multiple runs.

Now this makes it harder to make incremental improvements.

Every improvement needs to be larger than 2–3% in order for you to be able to reliably confirm it with a single run.

Either you spend more CPU power to get clear answers using averages, or you risk spinning in circles looking only for big-wins and forgoing incremental improvement.

Measuring your accuracy better means that when you make changes changes to your model, you can be confident as to whether those changes are leading to a better model in the ways you care about.

When you measure your accuracy wrong, you can end up tearing up changes or going back to the drawing board because that model you “thought” was 99.

9% accurate doesn’t actually work anywhere close to that in production.

Better measurement of accuracy means faster research, better products, and more accolades for you.

It can even save lives.

Communicating to Stakeholders that use our ModelsAnother reason we measure the accuracy of the model is so that we can communicate to stakeholders and they can use our models.

Models are never just pieces of math and code — they must sit and operate in the real world, having real effects on real peoples lives.

If a doctor is going to use an algorithm to make medical decisions, it’s important for them to know that the algorithm could be wrong, and how often that is the case.

If a company is going to replace a team of data entry people with a computer, it’s important for them to know how often it can make mistakes because that can affect the company’s processes.

If we claim a model will only make mistakes 3% of the time, but it actually makes 5%, we might brush that off as a small difference.

But that could represent a 60% increase in calls being made to the support department by all the people affected by the algorithms mistakes.

That massive increase in cost could completely nullify any benefits of implementing the algorithm in the first place.

Stakeholders need to understand the accuracy of an algorithm, and its typical failure cases, because accuracy has real world implications.

The accuracy could affect budgets and balance sheets, the lives and health of real people, and even the outcome of our democracy (when it comes to fact checking algorithms now used by journalists).

It could make or break new AI products, and create real disconnects between the engineers who create the technology and the consumers who use it.

Making mistakes in the measurement of accuracy could very well mean that lives are lost and new products fail.

Why Measuring Accuracy is HardSo what is it about measuring accuracy that makes it so difficult?.Why has a seemingly easy question to answer become so difficult?In part 2 of this series, we will go over some of the common problems that show up in measuring accuracy:The data your training the algorithm on is not the same as the data its expected to work on in productionYou care about certain types of failures more than other typesYour model, dataset or measurement may be inherently noisy or stochasticYour pipeline may have multiple different points at which accuracy can be measuredYour model may have several different metrics with different levels of granularityYou might only have ground-truth data for an intermediate step in the system, but not the final resultYour dataset might be broken down into different categories with wildly different performance between themIn part 3 of this series, we will address some of the more difficult and interesting problems that we face when measuring accuracy:You may not have any ground-truth data for your modelThere may be no off-the-shelf metric which measures accuracy for your modelThere may not be a clear way to define accuracy for your modelThe outcome you actually care about can’t be measured easilyMeasuring accuracy effectively is too computationally expensiveYour algorithm might be working in tandem with humansYour dataset might be changing and evolving, for example if its being actively grown by a data annotation teamYour problem space might be continuously evolving over timeIn light all of these potential problems in measuring accuracy, I’ve come to appreciate a basic piece of wisdom: no matter how you measure it or what you metric you use, we can usually agree on what perfectly correct and completely wrong looks like.

It’s everything in between that matters.

Look for Part 2 of this series coming out soon!Originally published at www.




. More details

Leave a Reply