Why Should I Care About Understanding My Model?

Why Should I Care About Understanding My Model?On the importance of deeper analysis of non-parametric models, the systems they are a part of, and their effects in the real world.

Abhimanyu AdityaBlockedUnblockFollowFollowingMar 25When parametric met non-parametric.

(source: pexels.

com)Non-parametric machine learning (ML) models (e.

g.

Random Forests, Neural Networks) are highly flexible but complex models that can attain significantly higher accuracy than parametric models such as regression based methods (e.

g.

logistic, linear, polynomial etc.

).

They can also be easier to use and more robust, leaving less room for improper use and misunderstanding.

But these advantages have a cost.

Compared to their parametric and often linear cousins, these models do not produce predictions that can be explained and their structure cannot be directly visualized i.

e.

they are not interpretable.

So let me say this upfront:It is important to understand your model.

Scratch that.

It is necessary to understand your model.

This is not a philosophical point of view.

It is a practical one.

I’ll highlight some key reasons with examples as to why.

For Correctness (which is != Accuracy)Powerful ML models will fit complex patterns in the many nooks and crannies of the data and it is impossible for the analyst to try and understand these directly as they are combinatorial in nature.

Still, doing some level of analysis of how the model generally behaves, how it interacts with the data and where the errors lie can help identify problems with the data or with the model, before it starts making predictions in the real world.

Making mistakes in a predictive system that is serving up ads on Facebook is one thing, making erroneous predictions for recommended treatment on someone in a hospital setting can be deadly.

Some great examples of the kinds of issues that can arise in the latter can be found in this paper.

For example, in the mid 90’s a bunch of funding went into various efforts to reduce the cost of medical diagnosis.

In one particular example, the goal was to predict the probability of death for a patient with pneumonia so that low-risk cases could be treated as outpatients and the high-risk pool would be admitted.

Various models were built to solve this problem and then the area under the curve (AUC), amongst other metrics, were measured.

Some of the most accurate methods turned out to be Neural Networks followed by Rule Based models followed by good ol’ Logistic Regression.

It was not possible to comprehend the neural network but one of the rules in the rules-based model looked suspicious:Has Asthma (x) => Lower Risk (x)This is counterintuitive.

Patients with pneumonia and a history of asthma are at high risk and are almost always admitted and usually treated aggressively in the ICU (I myself was recently given the vaccine for pneumonia because of my history with asthma and lung infections).

A bit of digging revealed that patients who had pneumonia and a history of asthma were admitted and treated as high risk, therefore, receiving significantly better care.

Thus, their prognosis was usually much better and the model learnt that asthmatic patients with pneumonia had a low probability of death (go figure!).

This highlights a key problem where models can learn unintended artefacts in the data, which or not aligned with the problem, and the opacity of non-parametric models (I refuse to use the term black box!) makes it very difficult to ascertain what’s actually going on.

In fact, one of the creators of LIME (local interpretable model agnostic explanations), an increasingly popular algorithm, that attempts to explain predictions of any predictor, does a great job of presenting another example of this with many parallels.

The prediction problem they emphasize is being able to discriminate between huskies and wolves from images of these.

They’re from U.

Wash.

and their mascot is a husky.

They want to be able to let people bring huskies but not wolves to the sports events (naturally, I feel like my pet wolf is being discriminated against here).

Which one is a wolf?.Or, how to build a great snow detector!.(source: pixabay.

com)To do this they decided to train a classifier using deep learning that is able to do this very well and present their results to technical folks to see if they think the model is acceptable and could be used.

But alas, there is subterfuge involved, they are aware that the image classifier has learnt a neat trick, that images of wolves they used all contained snow whilst those of huskies did not and the classifier was picking up the background as a feature rather than features of the animals themselves.

They had built a great snow detector!.This was done, on purpose, to study the acceptance of accurate but flawed models by savvy users (graduate students, most likely PhD, with ML experience) and the result was that about half of them accepted the model (the other half pointed either directly or indirectly to the background or related features and were sceptical).

While the study was small, it does well to illustrate the kind of problems that can occur without qualitative information available on how the model is working.

In the case of the medical diagnosis for pneumonia patients, it was decided, ultimately, that Logistic Regression is the better model as there was too much risk associated with the more accurate Neural Network models.

For accountability & fairness.

The criminal justice system is fair.

A provocative and hornet’s nest stirring statement I know.

But we are talking about Machine Learning and I am merely exploiting a charged political environment.

I was alarmed at a relatively recent study published by ProPublica regarding an algorithm used nationwide to generate a score for recidivism (the tendency of a convicted criminal to re-offend).

Judges, probation and parole officers are increasingly using these scores in pretrial and sentencing, the so-called “front-end” of the criminal justice system.

The study uses data from a county in Florida and found that the scores were significantly biased against African Americans (surprise?).

While there are certainly caveats to any study and there is wisdom and doing an in-depth review and come to your own conclusions, a couple of key statistics that the study found came from the confusion matrix when predicting for repeat offenders.

It was found that the false positive rate for African American defendants was 45% vs.

that for white defendants was 23%.

In other words, nearly half of the African Americans classified to recidivate (commit a crime again), actually wouldn’t vs.

about a quarter of white defendants were falsely classified as likely to recidivate but didn’t.

Both the numbers are high and the score is used by officers and judges as part of a “decision support system” but the stark difference of nearly 2X in the error rate is alarming.

A second metric, the False negative rate, that is those defendants that the algorithm identifies as not committing a crime again but they ultimately do is significantly higher for white defendants (48%) than that for African Americans (28%), i.

e.

white defendants have a significantly higher chance of getting the benefit of the doubt.

There are arguments of fairness on the other side as well and the company that created the algorithm published a rebuttal that can be found here which has been re-rebutted by ProPublica here.

(source: propublica.

org)Still, these numbers are troubling and raise huge concerns about the ethical nature of such tools and the amount of consideration and study such models should be given.

This is a hard problem, not just because it is hard, but also because of what is at stake.

How much trouble did the creators of the model go to drill through the data, analyze various angles, simulate and essentially document the analysis?.Keep in mind that “race” is NOT a variable used in the algorithm (that would be illegal) and sadly the sentencing commission is currently not conducting an analysis of bias in risk assessments.

Let’s be clear, latent demographic information is usually represented in the data in other features and much care is needed to test for these biases.

In these cases, it is all too easy to abide by the letter of the law but not its spirit.

I’ll ask the reader this: is it fair to use where someone went to school to decide whether to furnish them a loan?.It’s not a trick question, it’s just tricky.

For Better ScienceIt was almost a decade ago now that my previous startup, Skytree, was doing a proof of concept at one of the biggest credit card companies in New York, which involved a binary classification problem with a somewhat large but highly imbalanced dataset (positive to the negative class ratio was 1:1000).

We were using k-nearest-neighbors (yes, I know what some of you are thinking) and find neighbors in a stratified manner i.

e.

separately for each class, to deal with the imbalance.

The solution had high AUC but a large error in another metric the data scientists at the client tracked.

The stratified search had warped the scale of the output probabilities and this would not be a workable solution for applications where the scale mattered, such as risk (e.

g.

credit scoring, loan defaults).

(source: pixabay.

com)Our engineers came up with a clever hack (we nicknamed the “caveman” solution) to post-process the probabilities, which worked!.But the professor in the team was not satisfied (neither were we but…) and decided to mathematically derive a “proper” solution.

And lo and behold, luckily for us, the two solutions were equivalent.

The moral of the story here was that the data scientists at the client studied deeper metrics to understand the model that the algorithm designers did not consider and it is due to this data science that there was an important breakthrough that was designed and implemented by mathematicians.

The net-net was better science and better models, a win-win so to speak (a rarity in life etc.

).

Closing Thoughts: Build Trust!Today, many more tools and techniques are available to better understand non-parametric models and should be utilized effectively.

It is also essential that model builders do not view models in their isolation but as an important part of a broader system.

No longer should one sayI am just an engineerwhen one’s model has the potential to affect in serious ways the lives of people.

Ultimately, for our science to be successful and the proliferation of ML models a reality, it is critical that predictive systems are accountable, the decision path clear with room for recourse and as Machine Learning people we need to support the science of building models with the science of understanding them and their impact in real-world settings.

So I say to you again: It is necessary to understand your model.

Predict well!.

. More details

Leave a Reply