What does it mean to “productionize” data science?

Build a staircase.

But the room isn’t very big, which means the staircase is really steep and narrow.

No problem: build a wall next to the stairs so people can’t fall off.

But a bedroom needs a closet, and there wasn’t space for both a closet as well as the bed.

Again, not a problem — just cover half of the window with the closet.

The result is, in the most lenient definition of the term, a bedroom.

A lot of data science productionization — I would go so far as to say a majority of it — is the code equivalent of Fonthill Castle.

Those systems use good technology, follow many best practices at the code level, and are the result of the hard work of many competent people…and at best they are awkward.

At worst, they are a constant source of pain and expense to the business.

That bedroom in Fonthill Castle technically has all of the components of a bedroom — an entrance, a bed, a closet, and so forth — but people are generally happier in bedrooms that aren’t difficult to live in, and business are generally more efficient and profitable when they are structured in a way that processes for leveraging data don’t limit the value the data has to offer.

The better the system as a whole is designed, the more value you get from any one component.

Here’s an illustration.

It took me six months to create this one diagram:I’ve replaced all the labels with generic terms.

Of course, the diagram itself took only a few hours.

Everything the diagram represents took six months.

It’s a complete redesign of system for modeling home locations for IP addresses for the purpose of feeding those locations into a real-time bidding process.

This system ensures that if a client wants to target consumers in Milwaukee that they don’t pay for consumers in Los Angeles.

It’s a core business function, backing one of my employer’s core value propositions.

The little boxes and lines in the diagram represent the results of many hours of conversations with stakeholders, coordination with multiple engineering teams, productionization of two other processes whose outputs were needed to support this additional process, and the development of five separate algorithms.

All of those accomplishments stood on their own without me mapping out how they all fit together.

Using this diagram, however, our engineering team was able to write the code for deploying this entire process very quickly.

More important, when new requirements emerged after the end of that initial implementation, our engineers were able to incorporate them into the larger system relatively easily because they had a map of all the places the new requirements could or needed to be inserted.

The diagram above doesn’t actually represent the process as it now exists in our company, because putting that diagram into production helped us realize a whole lot of other constraints we needed to address.

But having the foundational design make those constraints more easily addressable, and helped us avoid a lot of clumsy workarounds.

Productionization involves up-front investment in systems that smooth the deployment, maintenance, and adoption of whatever data processes we choose to employ.

The design work necessary for productionization almost always lengthens the time it takes to launch a product, and because of that it is often neglected.

But adelayed launch is less frustrating and expensive than a blundered launch.

It pays to take the time to design the system well.

Airplanes vs.

taxis (or lines vs.

circles)That raises the question of what it means for a data science capability to be designed well.

A common mental model for analytic products is what I’ll call the airline analogy.

Airline flights are discrete events: you charter a flight by specifying certain places you want to go and certain things you want to have on your way there, then you have to actually fly on the plane, and then after disembarking you can look back to decide how the flight went.

The airline model is, in my opinion, a less-desirable design choice.

It makes it too easy to devote too much attention to automating the in-flight portion of data science, with the result that businesses end up doing too much of the pre-flight and post-flight manually.

Pre-flight work, for example, is largely a matter of asking clients what they want to accomplish and how to prioritize various considerations.

Data scientists build data products that transform those manual decisions into stuff an automated tool like a machine learning algorithm can understand.

Likewise, after the algorithm has done its thing, the output is often a report or presentation or dashboard gets sent to stakeholders, ostensibly to inform business decisions and possibly to motivate future analytic work.

When we view a data science project as a flight, it’s easy to think of things in terms of [humans] -> [computers] -> [humans], and thereby miss a lot of automation opportunities.

In my opinion, as long as we view data work as a series of flights we’ve leaving money on the table.

Aside from the human error that gets introduced when we rely on a lot of manual decisions, a focus on automation of in-flight work limits our ability to optimize.

If, for example, we aren’t very good at targeting a desired outcome— say, customer retention or sales — that may be because our prediction algorithms aren’t working as well as they could, but it could also mean that our analysis was scoped wrong or that we aren’t taking advantage of data that the analysis itself generates.

A fully-optimized data capability is one that continuously monitors performance and dynamically iterates forward from its initial setup criteria.

In other words, we need our data work to behave less like an airplane and more like a taxi.

In both cases, we have to decide on a destination, but we should be able to modify that destination mid-trip, or add unexpected stop-offs, or pick up new passengers on the way, or take an entirely different route than the one originally planned because we get new information.

Subsequent trips should automatically start at the previous trip’s destination.

Fully productionized data science is a circle, not a line.

If we do that, then instead specifying then conducting and then evaluating the analysis, we have just two tasks — operation and calibration — and each task feed into the other.

The point of measurement shouldn’t be to spin out a report, but rather to adjust the initial parameters of the analysis while that analysis is still in progress (the report can be produced as a side-effect of the calibration effort).

The initial parameters should be treated as priors to be modified rather than as requirements to be met.

This is the intuition behind multi-armed bandit algorithms.

Data science productionization moves humans out of the loop, but that doesn’t mean human don’t play a crucial role.

If we focus on automating every part of our data cycle, we actually create more opportunities for humans to meaningfully interact with the data and reduce the chances that our automation will result in unanticipated and undesirable consequences.

When humans can spend less time enabling the automation, they can spend more time monitoring and critiquing it.

Under the airplane model of data science, we have to look for distinct opportunities to connect the data systems with the business.

The goals, requirements, constraints, and other inputs to a data system are comparable to the map of destinations, schedule of flights and availability of seats, terminals and gates at the airports, and other information that allows someone to book a flight, while the data system itself is the actual fleet of airplanes.

If we make it possible for our systems to repeatedly self-assess and reconfigure themselves while still in progress, then then we don’t need to figure out connections: the data system is the business.

To reiterate: it’s perfectly normal and reasonable to think about data science work as [business needs] -> [algorithms] -> [results].

It is preferable to think about it as [business needs] -> [algorithms] -> [revised needs] -> [revised algorithms] -> [revised needs], and so on.

The major design challenge is not to choose the most appropriate algorithm or construct the most efficient implementation of that algorithm.

It’s to ensure automated, continuous revision of the analytic solution.

It’s to ensure the active discovery of opportunities that the original solution missed.

Systems are more important than toolsNot everyone who uses data science tools will find this kind of automated continuous improvement desirable or even possible.

That’s ok.

Not all data science needs to be productionized.

But full productionization improves model performance, decreases maintenance costs, shrinks risks (both financial and ethical), and, perhaps most important, minimizes the chance that good analytic results get ignored or wasted.

Decisions to invest in data science are often articulated in terms of goals — there are things you want to accomplish as a business, and data science is seen as a way to achieve those results.

But if we just drop really awesome technology into an inefficient or counterproductive system, the technology will only allow us to look more sexy as we fail.

The major determinant of success is the design of the systems we use to make data a part of our business.

That’s productionization.


. More details

Leave a Reply