Trust and interpretability in machine learning

A model should be considered to be interpretable if it can be derived (or at least motivated) from a trustworthy theory.

This definition of interpretability serves the dual purpose of understanding and trust.

It helps us understand the model because we tend to understand things in a deductive manner — by going from the known to the unknown.

Also, with such a definition, the trust in the model is derived from the trust that we place in the underlying theory.

Indeed, there are situations where both understanding and trust are necessary — scenarios where we are interested in determining the causal factors behind the behavior of a system.

In such scenarios, we must insist that the corresponding models must be interpretable according to the above definition.

Most models in the realm of physical sciences belong to this category.

One can argue that purely inductive blackbox models are not suitable for such scenarios.

However, there are many other situations where understanding might be nice to have, but by no means is it a must have.

In these situations what really matters is the ability to make trustworthy predictions.

In these situations, if we could provide an alternate source of trust, then our models need not be bound by the definition of interpretability given above.

This is a common argument, and there is merit to it.

Remember, machine learning is a way of systematically building models from (preferably) large amounts of data using inductive reasoning.

Constraining these models to be interpretable in a deductive manner can seriously limit their accuracy.

So then the question becomes how can we generate trust in a blackbox model where we have little to no insight into its inner workings.

A credible basis for trust could be testing.

After all, testing forms the basis of our trust in regular software.

But to test a model we need be able to formalize our expectations about it.

If we could formalize our expectations completely then that would correspond to a complete specification of the model itself.

In that case, we would not really need machine learning or any other modeling methodology.

What we really need to be able to do is to formalize our expectations about the aspects of the model that we consider important.

This is not easy either, because many of the concepts that we care about, such as fairness, do not lend themselves to a convenient mathematical treatment.

It is worth pointing out that significant progress has been made in developing testing methodologies for testing machine learning models.

I personally find the idea of using metamorphic relations for formalizing expectations to be particularly promising.

But, we are still a long way from having concrete methodologies that will allow us to perform comprehensive testing of blackbox models, and this inability of ours contributes to a trust deficit in blackbox models.

One could question the efficacy of such expectation-based comprehensive testing.

After all, the goal of machine learning is to find undiscovered patterns in data.

By insisting that the models meet our expectations amounts to pre-defining the model, which defeats the whole purpose.

Following this line of reasoning, one would argue that as long as the data is representative and our algorithms are powerful enough to capture the patterns, there is little reason not to trust the model — we should expect the model results to generalize to the overall population, and the extent to which we should expect them to generalize is encapsulated in the model’s performance (accuracy) scores.

Thus, in essence we are asked to delegate our trust to the trifecta of data, algorithms and performance scores.

We first need to dissuade ourselves from the notion that a single performance (accuracy) score can form sufficient basis for trusting the model.

A performance score is usually a point estimate of how a model is expected to generalize on an average over a population given the current data.

Trust, on the other hand, is a nuanced multidimensional concept that cannot be encapsulated in such a single coarse grained score.

One can imagine defining more granular performance scores— e.


by population segments.

But, that would require a certain level of understanding of the population and determining what we consider important — this is not very different from forming expectations.

Let us examine the data aspect of this argument.

It is, indeed, quite easy to convince oneself that if the data is representative of the population we are interested in, then it should contain all the relevant patterns and no spurious ones.

Unfortunately, that is rarely the case.

The degree to which the data can be non-representative depends quite acutely on the situation.

Nonetheless, we can identify certain high level scenarios.

In the first scenario, we would have a good understanding of the population and complete control over the data collection mechanism.

In this scenario, we can choose our data to be representative, and with a high degree of confidence we can expect our resulting model’s predictions to be applicable to the overall population.

However, note that having a good enough understanding of the population to be able to draw a representative sample for the task at hand means that we already have some understanding of which features are important for the prediction.

Hence, in this case it is debatable if blackbox models are terribly useful.

Opinion polling for predicting election results is a good example of this scenario.

In the second scenario, we do not have complete control over the data collection, but our predictions do not affect the data collected.

In this scenario, if we assume that the data collection mechanism is unbiased then were we to wait long enough, we would have a representative sample of the population.

Of course, there are a lot of ifs and buts that go with this assumption.

Firstly, one does not know how long is long enough.

Thus one needs to assume that the time scale over which the data is collected is long enough to produce a representative sample.

Furthermore, the population itself might change in the meantime.

Thus, an additional assumption is that the time scale over which population changes is much longer than the time scale over which a representative sample is generated.

As long as we can justify those assumptions, then the estimated performance will be reliable.

A model for predicting the stock prices is an example of such a scenario — as long as we are not making investments that are large enough to tip the market as a whole, the decisions that we make as the result of the predictions should not affect the stock prices.

The third scenario is one where the data collection is impacted by the predictions, but we have a moderate to high risk appetite for wrong predictions.

An example of this is a product recommender system.

The model for a recommender systems will be trained on data consisting of ordered lists of products that different users have bought/clicked on.

Based on this data the model will predict what a user is most likely to buy/click on and based on the model’s predictions the system will decide what the user gets to see, which limits what (s)he can buy/click-on.

Thus the prediction biases the data collection.

In product recommender systems, one can circumvent this problem, somewhat, by keeping an exploration budget — for a fraction of the cases the system shows the user a random set of products regardless of the prediction of the model.

The observations resulting from these randomized predictions can then be used to estimate the performance of the model.

One still has to address the concerns of the aforementioned second scenario in order to access the reliability of these estimates.

In the fourth and final scenario, the data collection is impacted by the predictions, but we have little to no risk appetite for wrong predictions.

For example, suppose we have to build a model to predict whether someone will default on their mortgage loan payments.

The mortgage loan will be approved or not based on the prediction.

If the prediction is that the person will default, then the loan will not be approved, and in that case there is no way of knowing whether this person would have actually defaulted or not.

It is difficult to imagine a situation where an institution would randomly approve (or otherwise) a loan for the sake of data exploration.

In these situations, it is very difficult to gauge the reliability of the estimated performance of the resulting model without additional information.

Thus, it is not such a great idea to blindly expect the data to be representative of the population.

In most scenarios, given the constraints of the problem at hand, it simply might not be possible to get an unbiased representative sample.

Understanding the limitations of one’s data collection mechanism, being able to deduce the implications of those limitations, and having the honesty to report those as a part of the model’s results goes a long way in building trust.

Let us now consider the algorithm aspect of the argument.

It is a widespread belief that the more flexible an algorithm is the better it is, because flexibility equips an algorithm to capture more complex patterns.

But if the history of the actual successful applications of machine learning are anything to go by, then this belief would appear to be utterly misplaced.

In computer vision, success came when we were able to encode the symmetries in pictures into models in the form of convolutional neural networks.

In natural language processing we are now able to build extremely accurate cross-purpose language models because we could encode our knowledge about languages, including structure and word context, into these models.

In recommender systems — most collaborative filtering algorithms including matrix factorization methods, make strong assumptions about the affinity of a user towards an item.

Whether we would like to slap the label of interpretability on these models or not, it is an objective fact that we build better models when we understand the domain and the context in which the model needs to operate.

The best models do not come from the most flexible algorithms, they come from algorithms that are well constrained by domain knowledge and have just the right amount of flexibility to capture the relevant patterns in the data.

We have seen the word understanding being used quite a few times in the above discussion.

What we should have realized by now is that it is difficult to build trust without understanding.

In the end, it boils down to how one perceives machine learning.

Yes, machine learning is an incredibly powerful inductive modeling technique.

When combined with big data and big compute, it allows us to model systems and solve problems that were previously out of our reach.

But the entry of machine learning should not imply the exit of everything else, including common sense.

Machine learning is one element in the wider modeling family that includes deductive modeling as well as domain knowledge.

The better we understand and leverage the interconnections between these elements, the further we will go towards robust complex system modeling.

Trust is contextual and trust can have multiple sources, but eventually it flows from knowledge and integrity; specifically in our trust in the knowledge and integrity of the individuals who are building the models.

Trust as well as adoption of models will come, in my opinion, only when the wider audience is convinced that the modelers have the knowledge to understand the limitations of their models (machine learning or otherwise), and the integrity to report them.


. More details

Leave a Reply