Overview of the different approaches to putting Machine Learning (ML) models in production

Evaluating and debugging real-time prediction models are significantly more complex to manage.

They also require a log collection mechanism that allows to both collect the different predictions and features that yielded the score for further evaluation.

Batch Prediction IntegrationBatch predictions rely on two different set of information, one is the predictive model and the other one is the features that we will feed the model.

In most type of batch prediction architecture, ETL is performed to either fetch pre-calculated features from a specific datastore (feature-store) or performing some type of transformation across multiple datasets to provide the input to the prediction model.

The prediction model then iterates over all the rows in the datasets providing the different score.

example flow to model serving for batch predictionOnce all the predictions have been computed, we can then “serve” the score to the different systems wanting to consume the information.

This can be done in different manner depending on thee use case for which we want to consume the score, for instance if we wanted to consume the score on a front-end application, we would most likely push the data to a “cache” or NoSQL database such as Redis so that we can offer milliseconds responses, while for certain use cases such as the creation of an email journey, we might just be relying on a CSV SFTP export or a data load to a more traditional RDBMS.

Real-time Prediction integrationBeing able to push model into production for real-time applications require 3 base components.

A customer/user profile, a set of triggers and predictive models.

Profile: The customer profile contains all the related attribute to the customer as well as the different attributes (eg: counters) necessary in order to make a given prediction.

This is required for customer level prediction in order to reduce the latency of pulling the information from multiple places as well as to simplify the integration of machine learning models in productions.

In most cases a similar type of data store would be needed in order to effectively fetch the data needed to power the prediction model.

Triggers: Triggers are events causing the initiation of process, they can be for churn for instance, call to a customer service center, checking information within your order history, etc …Models: models need to have been pre-trained and typically exported to one of the 3 formats previously mentioned (pickle, ONNX or PMML) to be something that we could easily port to production.

There are quite a few different approach to putting models for scoring purpose in production:Relying on in Database integration: a lot of database vendors have made a significant effort to tie up advanced analytics use cases within the database.

Be it by direct integration of Python or R code, to the import of PMML model.

Exploiting a Pub/Sub model: The prediction model is essentially an application feeding of a data-stream and performing certain operations, such as pulling customer profile information.

Webservice: Setting up an API wrapper around the model prediction and deploying it as a web-service.

Depending on the way the web-service is setup it might or might not do the pull or data needed to power the model.

inApp: it is also possible to deploy the model directly into a native or web application and have the model be run on local or external datasources.

Database integrationsIf the overall size of your database is fairly small (< 1M user profile) and the update frequency is occasional it can make sense to integrate some of the real-time update process directly within the database.

Postgres possess an integration that allows to run Python code as functions or stored procedure called PL/Python.

This implementation has access to all the libraries that are part of the PYTHONPATH, and ass such are able to use libraries such as Pandas and SKlearn to run some operations.

This can be coupled with Postgres’ Triggers Mechanism to perform a run of the database and update the churn score.

For instance if a new entry is made to a complaint table, it would be valuable to have the model be re-run in real-time.

Sequence flowThe flow could be setup in the following way:New Event: When a new row is inserted in the complain table, an event trigger is generated.

Trigger: The trigger function would update the number of complaint made by this customer in the customer profile table and fetch the updated record for the customer.

Prediction Request: Based on that it would re-run the churn model through PL/Python and retrieve the prediction.

Customer Profile Update: It can then re-update the customer profile with the updated prediction.

Downstream flows can then happen upon checking if the customer profile has been updated with new churn prediction value.

TechnologiesDifferent databases are able to support the running of Python script, this is the case of PostGres which has a native Python integration as previosuly mentioned, but also of Ms SQL Server through its’ Machine Learning Service (in Database), other databases such as Teradata, are able to run R/Python script through an external script command.

While Oracle supports PMML model through its data mining extension.

Pub/SubImplementing real-time prediction through a pub/sub model allows to be able to properly handle the load through throttling.

For engineers, it also means that they can just feed the event data through a single “logging” feed, to which different application can subscribe.

An example, of how this could be setup is shown below:The page view event is fired to a specific event topic, on which two application subscribe a page view counter, and a prediction.

Both of these application filter out specific relevant event from the topic for their purpose and consume the different messages in the topics.

The page view counter app, provides data to power a dashboard, while the prediction app, updates the customer profile.

Sequence flow:Event messages are pushed to the pub/sub topic as they occur, the prediction app poll the topic for new messages.

When a new message is retrieved by the prediction app, it will request and retrieve the customer profile and use the message and the profile information to make a prediction.

which it will ultimately push back to the customer profile for further use.

A slightly different flow can be setup where the data is first consumed by an “enrichment app” that adds the profile information to the message and then pushes it back to a new topic to finally be consumed by the prediction app and pushed onto the customer profile.

Technologies:The typical open source combination that you would find that support this kind of use case in the data ecosystem is a combination of Kafka and Spark streaming, but a different setup is possible on the cloud.

On google notably a google pub-sub/dataflow (Beam) provides a good alternative to that combination, on azure a combination of Azure-Service Bus or Eventhub and Azure Functions can serve as a good way to consume the mesages and generate these predictions.

Web ServiceWe can implement models into productions as web-services.

Implementing predictions model as web-services are particularly useful in engineering teams that are fragmented and that need to handle multiple different interfaces such as web, desktop and mobile.

Interfacing with the web-service could be setup in different way:either providing an identifier and having the web-service pull the required information, compute the prediction and return its’ valueOr by accepting a payload, converting it to a data-frame, making the prediction and returning its’ value.

The second approach is usually recommended in cases, when there is a lot of interaction happening and a local cache is used to essentially buffer the synchronization with the backend systems, or when needing to make prediction at a different grain than a customer id, for instance when doing session based predictions.

The systems making use of local storage, tend to have a reducer function, which role is to calculate what would be the customer profile, should the event in local storage be integrated back.

As such it provides an approximation of the customer profile based on local data.

Sequence FlowThe flow for handling the prediction using a mobile app, with local storage can be described in 4 phases.

Application Initialization (1 to 3): The application initializes, and makes a request to the customer profile, and retrieve its initial value back, and initialize the profile in local storage.

Applications (4): The application stores the different events happening with the application into an array in local storage.

Prediction Preparation (5 to 8): The application wants to retrieve a new churn prediction, and therefore needs to prepare the information it needs to provide to the Churn Web-service.

For that, it makes an initial request to local storage to retrieve the values of the profile and the array of events it has stored.

Once they are retrieve, it makes a request to a reducer function providing these values as arguments, the reducer function outputs an updated* profile with the local events incorporated back into this profile.

Web-service Prediction (9 to 10): The application makes a request to the churn prediction web-service, providing the different the updated*/reduced customer profile from step 8 as part of the payload.

The web-service can then used the information provided by the payload to generate the prediction and output its value, back to the application.

TechnologiesThere are quite a few technologies that can be used to power a prediction web-service:FunctionsAWS Lambda functions, Google Cloud functions and Microsoft Azure Functions (although Python support is currently in Beta) offer an easy to setup interface to easily deploy scalable web-services.

For instance on Azure a prediction web-service could be implemented through a function looking roughly like this:ContainerAn alternative to functions, is to deploy a flask or django application through a docker container (Amazon ECS, Azure Container Instance or Google Kubernetes Engine).

Azure for instance provides an easy way to setup prediction containers through its’ Azure Machine Learning service.

NotebooksDifferent notebooks providers such as databricks and dataiku have notably worked on simplifying the model deployment from their environments.

These have the feature of setting up a webservice to a local environment or deploying to external systems such as Azure ML Service, Kubernetes engine etc…in AppIn certain situations when there are legal or privacy requirements that do not allow for data to be stored outside of an application, or there exists constraints such as having to upload a large amount of files, leveraging a model within the application tend to be the right approach.

Android-ML Kit or the likes of Caffe2 allows to leverage models within native applications, while Tensorflow.

js and ONNXJS allow for running models directly in the browser or in apps leveraging javascripts.

ConsiderationsBeside the method of deployments of the models, they are quite a few important considerations to have when deploying to production.

Model ComplexityThe complexity of the model itself, is the first considerations to have.

Models such as a linear regressions and logistic regression are fairly easy to apply and do not usually take much space to store.

Using more complex model such as a neural network or complex ensemble decision tree, will end up taking more time to compute, more time to load into memory on cold start and will prove more expensive to runData SourcesIt is important to consider the difference that could occur between the datasource in productions and the one used for training.

While it is important for the data used for the training to be in sync with the context it would be used for in production, it is often impractical to recalculate every value so that it becomes perfectly in-sync.

Experimentation frameworkSetting up an experimentation framework, A/B testing the performance of different models versus objective metrics.

And ensuring that there is sufficient tracking to accurately debug and evaluate models performance a posteriori.

Wrapping UpChoosing how to deploy a predictive models into production is quite a complex affair, there are different way to handle the lifecycle management of the predictive models, different formats to stores them, multiple ways to deploy them and very vast technical landscape to pick from.

Understanding specific use cases, the team’s technical and analytics maturity, the overall organization structure and its’ interactions, help come to the the right approach for deploying predictive models to production.

More from me on Hacking Analytics:One the evolution of Data EngineeringAirflow, the easy wayE-commerce Analysis: Data-Structures and ApplicationsSetting up Airflow on Azure & connecting to MS SQL Server3 simple rules to build machine learning Models that add value.. More details

Leave a Reply