Deployed your Machine Learning Model? Here’s What you Need to Know About Post-Production Monitoring

The next best thing to do is to continuously track the health of the machine learning model against a set of key indicators and generate specific event-based alerts.

The obvious follow-up questions are what are these key indicators and which events trigger an alert.

These questions are addressed by the proactive model monitoring framework.

The key element of the monitoring framework is to identify which input samples deviate significantly from the patterns seen in the training data and then have those samples closely examined by a human expert.

Unfortunately, there is no universal way of identifying which patterns are most relevant.

Patterns of interest largely depend on the domain of the data, the nature of the business problem, and the machine learning model being used.

For example, in the Natural Language Processing (NLP) domain, some of the simple patterns could be: Identify all the data samples which have at least one word not seen in the training data Identify all the data samples which are at least N-words longer or M words shorter than the average number of words in the training data A slightly more advanced pattern could be based on the machine learning technique used.

For example, again in the NLP domain, assume that a distributed representation was used to place every word in an L-dimensional space.

We can quantify the word distribution in the training data using modeling techniques like Gaussian Mixture Models (GMM).

Now, given a test data sample, find the probability of the sample given the GMM.

All data samples with a probability lower than a certain threshold can be marked as ‘non-representative’  (i.

e.

, anomalous) and sent to the domain experts for further investigation.

Source: ResearchGate Even more sophisticated patterns for identifying test samples of interest can be devised based on the knowledge of the business problem, the specifics of the data, or the specifics of the machine learning machinery used.

For instance, any machine learning solution can be thought of as a combination of multiple elemental ML components.

As an example, a machine learning model for intent mining in a conversational agent may consist of three ML modules: A module for audio analysis of the raw speech input to identify the sentence type (i.

e.

, statement, question, exclamation or command) A module for text analysis of the transcribed speech input to identify the semantic message, and A module that combines the output of the other two modules to identify the intent During the training phase, we can identify the relative proportion of the paths traversed by different training samples through these three modules and the corresponding predicted outputs.

During the model monitoring phase, we can identify the samples that led to a particular output but the path traversed through the three modules wasn’t one of the paths observed during the training phase for that output.

Note that to achieve this level of pattern-based model monitoring, the end-to-end solution needs to have a robust logging mechanism.

  Reactive Model Monitoring After the successful deployment of a machine learning-driven solution, the data science team will almost always feel like they have earned bragging rights like “our system has state-of-the-art 99% accuracy!”.

But instinctively (and rightfully so), the first thing that the customer-facing teams will ask is “what is the plan to address customer escalations on the 1%?”.

This calls for reactive model monitoring which performs root-cause-analysis (RCA) of the customer escalations and provides an estimate of when the bugs will be fixed.

Reactive model monitoring is quite similar to that of proactive model monitoring.

But there are subtle differences in terms of the end goals.

Whereas proactive model maintenance identified general patterns in the test data which are outliers compared to those in the training data, the goal of Reactive Model Maintenance is to identify what led to an erroneous output in a specific test sample and how it can be rectified.

The data science team thus needs to be cautious when accepting the rectifications suggested by the reactive model maintenance process as those recommendations can possibly be detrimental to a wide range of data samples.

Some of the other challenging aspects of reactive model maintenance are that some bugs can be resolved by a simple change in one of the config files while some may need elaborate retraining of the ML model.

Also, some bugs may be within the tolerance threshold of a typical user while some maybe what I call as the ‘publicity-hungry’ bugs.

A ‘publicity hungry’ bug is any incorrect behavior of the machine learning system which is totally unexpected from a human expert.

For instance, in an ML-powered conversational agent, in response to the user’s query of “I am tired”, if the agent responds with “Hello Mr.

Tired, how are you?”, then that is sure to get a lot of tweets and retweets and similar publicity!.Such publicity-hungry bugs need immediate resolution.

The Service Level Agreements (SLAs) will thus need to be carefully crafted keeping in mind the severity of the bug on one hand and the systemic changes needed on the other hand.

  Address the Root Cause, Not the Symptoms Given these wide varieties of sources that may lead to a drop in performance of the ML-systems over time and the intense pressure to fix the issues within a given SLA, it can be tempting to have a ‘thin-layer-of-rules’ which bypasses the ML machinery completely to address the immediate customer escalation.

Such a thin-layer or hot-fix approach is actually a ‘lazy-fix’ which has the potential to turn disastrous in the long run.

Thus, such a thin-layer of rules should be touched only under extreme conditions and should not be allowed to get beyond a certain ‘thickness’.

When the pre-defined ‘thickness’ is reached, our machine learning model has to be retrained to address the issues encoded in the thin-layer.

To borrow an analogy from the medical domain: addressing symptoms may not need an expert but if that is routinely substituted for a thorough diagnosis, the situation can precipitate quite rapidly.

Just like accurate medical diagnosis comes from analysis of the patient’s history, proactive model maintenance has to be broad enough to quickly help identify the root cause of a customer escalation.

Retraining a machine learning model that is already deployed in a live production environment is much easier said than done.

For one, there are multiple ways to solve a particular data-driven problem, and as we see more data our choice of the model may change.

Secondly, the data science team that built the original model and the team that is maintaining the model may not readily agree on the best way to retrain the model.

Moreover, the team that built the original model may have tried out a wide variety of training strategies/modeling techniques before settling on one.

This information is typically not documented and hence model retraining may very well lead to a net drop in the accuracy.

To add to the mix, a lot of the times the end client may prefer receiving consistent output over a now-correct-but-earlier-incorrect output.

Here is what I mean by this: Say your original speech recognition system would confuse “Tim” with “Jim” about 80% of the time.

The end client estimated this frequency of error and has included mechanisms in their downstream processing to try both ‘Tim’ and ‘Jim’ with an 80-20 proportion.

Suddenly, when the retrained speech recognition system reduces the Tim/Jim confusion to only 10%, the end customer may not readily agree to make the necessary (potentially non-trivial) changes on their end.

 The business teams and the customer-facing teams may, in such cases, make a decision that certain customers will continue to get the old speech recognition system while the other customers will be migrated to the newer one.

This means the data science teams will now have to maintain two models!.This opens up a whole new area of discussion called ‘technical debt of machine learning models’.

Consistency can trump accuracy.

Turns out “Be Consistent!” is just as great a motivating phrase for ML models as it is for humans!.An area I would love to discuss more, but not in this series.

  What’s in a Name!.“What’s in a name?” – William Shakespeare Finally, the general perception is that the phrases ‘model maintenance’ and ‘model monitoring’ sound ‘uncool’ compared to ‘model building’.

In contrast, what I have seen is that the level of data science maturity, depth of big data engineering, and business understanding needed in ‘model maintenance’ is the order of magnitude more than what is needed in ‘model building’.

I am always tempted to rebrand ‘model maintenance’ as ‘model nurturing‘ particularly so in the light of the critical role maintenance and monitoring play in ensuring customer delight.

  End Notes If you are in the tech industry, there is no escaping the buzz around Artificial Intelligence, Machine Learning, Data Science and related keywords.

I genuinely believe that all this focus on data-driven technologies will help bring in substantial efficiency in existing processes and help conquer new tech frontiers which have long been elusive.

However, the general expectations from these technologies are dangerously unrealistic, largely fed by the popular imagination of sci-fi literature.

Part of it is also affirmed by what we see in some of the low-stakes consumer-AI applications.

When executive decision-makers set such expectations of their data science groups, they inadvertently ignore two important factors: Data science cannot generate impact in isolation and that the entire organization has to be trained into a ‘data-culture’, which of course is easier said than done, and Years of concerted efforts by data experts have gone into building the consumer-AI applications that are gaining popularity in the media of late.

This mismatch between the expectations and the actual reality is driving us closer to what is termed as the ‘AI Winter’ I am certain that data-driven technologies are the best solution to solve most of the problems that the tech world faces today.

But, in the same breath, for these technologies to succeed, we need a holistic approach with the right expectations.

Through this four-article series, I am hoping to share my learnings of bridging the gap between a ‘prototype of a data-driven solution’ and an actual ‘data-driven solution deployed in the real-world with stringent SLAs’.

I hope you will find these learnings valuable as you continue your journey on data-driven-transformation.

I would absolutely love to hear your thoughts on this.

Please do share your comments below or reach out at [email protected] / [email protected] You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.

. More details

Leave a Reply