Rethinking the Data Science Life Cycle for the Enterprise

How much faster could models be promoted into production if data scientists could easily find the lifecycle outputs required by an enterprise gating process?Most data scientists would overwhelmingly agree that the benefits of reproducibility, traceability, and verifiability improve their effectiveness and agility.

Reproducibility, Traceability, and Verifiability Drive Data Science Lifecycle MaturityConsider a few common scenarios that data scientists experience regularly, and the shortcomings in current data science lifecycles:A data scientist must address production data drift which is leading to degraded model performance: An enterprise with a mature lifecycle can quickly re-establish the model baseline by acquiring the model source code, and training files and then reproduce results, after which a model can be updated and re-trained.

Unfortunately, in many enterprises this simple scenario is difficult to achieve since linkages between model source code and training data are not maintained.

A data scientist must prove to an auditor or regulator that a model achieves a correct result based upon original business events: In this case the data scientist in a mature enterprise would be able to demonstrate how raw data files are correctly enriched, transformed, and/or aggregated such that the model can be verifiably trained based upon a traceable lineage of data.

However, most enterprises are unable to link training data lineage and the associated transformations that must come together to form a verifiable model.

A data scientist must demonstrate that all enterprise gates and checks have been successfully completed prior to promoting a model: To facilitate this outcome in a mature enterprise, DEVOPS and MLOPS processes would automatically capture critical artifacts and required AI/ML lifecycle outputs required by an enterprise gating process.

Unfortunately, few enterprises automatically (or even manually, in some unfortunate cases) capture and retain critical outputs from the AI/ML lifecycle required to allow easy and quick promotion of a model.

Clearly there are significant opportunities to improve an enterprise’s data science lifecycle maturity by focussing on reproducibility, traceability, and verifiability.

The real question is not if an enterprise should address the issues, rather the question is how quickly an enterprise must address these issues.

The New Data Science Lifecycle Enabler: the Lifecycle CatalogReproducibility, traceability, and verifiability are enabled by several simple capabilities: capturing artifacts about models as they evolve over the data science lifecycle, storing these artifacts, and, searching and viewing artifacts.

These capabilities are enabled with slight additions to the current data science lifecycle while providing a tool — a “Lifecycle Catalog” — that provides a view port into the data science lifecycle.

In simple terms, the Lifecycle Catalog is a portal into a repository that contains references for model source code, model training files, raw source data and programs that transform the data into training files, and other artifacts that are captured along the data science lifecycle:To address reproducibility, the Lifecycle Catalog provides references to a model’s source code — both current and previous releases — and the data that was used to train it are maintained in the inventoryTo address traceability, the Lifecycle Catalog maintains references to original source system data and the data engineering scripts that were used to transform and enrich the thereby providing visibility to all changes to data across the delivery lifecycle.

To address verifiability, references to training outputs, logs, and related artifacts — including output logs related to a model’s bias and “ethical” checks — are managed by the Lifecycle Catalog thereby capturing evidence of a model’s efficacy.

To automate the information capture process, the Lifecycle Catalog would integrate with an AI/OPS (DEVOPS for AI/ML) process to automatically capture the aforementioned artifacts.

Interestingly, tools and capabilities are becoming available from major cloud providers, traditional DEVOPS vendors, as well as newer AI/OPS start-ups which can be stitched together to capture many of the required metrics and meta-data.

The Lifecycle Catalog’s portal allows an enterprise’s data scientists search, visualize, and track models, related data, and artifacts across the AI/ML lifecycle:To drive agility, the Lifecycle Catalog allows data scientists to: (a) View and manage their model inventory, (b) View the status (deployed, under development, etc) of AI/ML models and versions, © View and manage references to training assets and related data lineage assets used to create each AI/ML model, (d) View and manage references to artifacts generated throughout the AI/ML lifecycle.

To drive effectiveness, the Lifecycle Catalog provides access to all governed AI/ML artifacts.

Consider a scenario where this is critical: first, many organizations now must prove that models are unbiased and provide results that are in keeping with the corporation’s ethical guidelines — in this case, bias testing outputs are maintained to offer this proof.

To address security, the Lifecycle Catalog is accessible by only authenticated and authorized individuals can view and/or manage AI/ML models, deployments, or training assets.

To drive efficiency, the Lifecycle Catalog offers all of the evidence required by an enterprise gating process thereby offering the potential for “self-serve” AI/ML Governance.

The Lifecycle Catalog is the “book of record” for the data science lifecyle allowing data scientists to visualize, categorize, group, manage, and govern all of an enterprise’s data science assets.

The Lifecycle Catalog is the missing links in the data science lifecycle that enables and enterprise’s fundamental need for reproducibility, traceability, verifiability.

Other Articles By Eric BrodaAI/Machine Learning in the Enterprise Just Became Much More Challenging: Here is what you need to know to address data privacy and data management concernsThe Most Difficult Part about AI/Machine Learning Occurs after the Model is Created: Enterprise DEVOPS must evolve to accommodate AI/Machine Learning (AI/ML).

Traditional IT Governance Must Be Reengineered For Enterprise AI/ML: What needs to be in place to drive AI/ML agility while ensuring necessary reproducibility and traceability?We Should Be Breaking Up the Facebook Monopoly: The current thinking by politicians and insiders to breaking up Facebook will not address monopoly concerns, nor lead to a more competitive market.

There is a better way.


. More details

Leave a Reply