An engineer’s perspective on engineering and data science collaboration for data products

As engineers, we may look down on one-offs or distasteful hacks, but the YAGNI principle suggests doing the simplest thing that could possibly work.

In product engineering, this is done by building a minimum viable product.

But continuously productionizing models risks iterating toward a local optimum.

Iteration by definition is incremental, while building a platform enables new capabilities.

We believe productionizing models is a necessary prerequisite to building a platform.

Building a platform is like building an abstraction, as it grants the users capabilities without them needing to understand the details.

Deciding what to put in the abstraction is the challenge.

So much so that duplication is often preferable to the wrong abstraction.

As a result, our initial investments in a new data product take the form of a minimum viable product.

This product often requires manual work to iterate on, but is crucial to discovering patterns and high-value iteration paths.

Those paths systemize into platforms.

The platform provides a foundation for future iteration and new feature development by the data scientists.

For the engineer, maintenance of the platform going forward is necessary, but they will have automated away a part of the previously manual workflow, and they have also increased the velocity of iteration.

In our experience, the engineering maintenance cost is worth it.

Personalized in-course coaching message within the learning experienceAn example of how we struck a balance between productionizing models and building a platform is our automated coaching feature.

In the present iteration, we have the capability to target messages to any learner segment and to personalize messages to each individual learner.

We use a feedback loop to control the volume and relevance of messages.

Our automated coach can also collect learner goals that we surface back to learners at later times.

But this data product feature didn’t start out this way.

We first productionized an automated coaching feature capable only of sending generic messages to all learners doing a certain action during a certain course.

We were able to track learners’ receptiveness to the various message types, and tested out various learning nudges, such as emphasizing a growth mindset.

After iteration on the message types and copy, we saw big differences between messages in engagement and helpfulness rates.

We also hypothesized additional use cases for these in-course coaching messages such as allowing learners to self-assess their competency on a module.

At this point, building out a platform to enable high-value iterations became a necessary and natural next step.

Striking a balance between productionizing models and building platforms results in:Speedy fulfillment of specific use casesNew capabilities emerging as the right forward-looking features are systematized into a platformAn ability to steadily and iteratively improve data product features and impactUse data and SQL as the universal languageData scientists at Coursera operate in R and Python, while engineers write Scala.

There are a few viable approaches to bridging this difference when developing data products.

One way is to train data scientists in Scala and engineers in R and Python.

This approach is common in smaller organizations where individuals wear many hats.

The pros of this approach are that coordination costs are minimal and flexibility is maximized.

Engineers and data scientists jointly define and redefine the collaboration model on an as-needed basis for each data product.

But engineers and data scientists with cross-trained skills are hard to find.

This strategy also punishes high-performing engineers and data scientists who prefer to focus on their domain of expertise.

A second way is to have data scientists own the model prototyping phase and engineers own the model productionizing phase.

This approach is common in larger organizations that can afford to hire for specialized roles such as machine learning engineers.

The pro of this approach is that this specialization can bring efficiency.

Domain expertise and industry best practices have emerged around the ML engineering field.

However, ownership questions arise as machine learning engineers need to interface with both front-end product engineers and data scientists to productionize a data product.

Striking the right headcount balance among data scientists, machine learning engineers, and product engineers is another challenge.

A third way is to use data and SQL as the intermediary.

In this approach, data is the lingua franca among data scientists and engineers.

We’ve had good success with this approach in the past few years.

A benefit of this approach is that SQL + data is a constrained interface that requires minimal training to operate.

Data is dumb.

It is easy to inspect, visualize, and debug data using SQL, and it is easy to collaborate without hidden states, assumptions, and nuances.

Furthermore, this approach tightens the iteration loop, as data scientists can iterate on a model from end to end.

We think this approach works for the majority of cases.

But we recognize there are scenarios where data is not an ideal interface.

The two main scenarios are when we need to encode stateful operations in data, and when precomputation of results is onerous.

In practice, we’ve found these scenarios to be infrequent and not first-order concerns.

To use data and SQL as the universal language, we’ve had to build out and democratize our data warehouse, solve the problem of who writes ETLs (answer: everyone), and provide interfaces, libraries, and tools to make the data and SQL ubiquitous across the data science and engineering organizations.

“Based on your recent activity” recommendations moduleAn example of engineering and data science collaborating at the data boundary is our recommendations module infrastructure.

It is a system that produces recommendations at various degrees of personalization.

Recommendation modules range from fully personalized to the user (e.

g.

, “Based on your recent activity”) to generic cold start recommendations to everything in between (e.

g.

, “Because you viewed Machine Learning”).

Algorithms generating the recommendations range from matrix factorization to regression to rule-based queries.

But data is an effective encapsulation — a combination of results, scores, and metadata is an effective internal API.

It meets the characteristics of a good API: It’s easy for engineers to consume, easy for data scientists to produce, and sufficiently powerful for our use cases.

Using data and SQL as the universal language results in:Clear boundaries of focus between engineers as data consumers and data scientists as data producersAn understandable and debuggable interfaceA common language between data scientists and engineers when collaborating on shared concernsAt Coursera, engineers and data scientists have built many data products.

We’ve learned that building a data product is a team sport.

As with any team, our goal is to be more than the sum of our parts through effective collaboration.

This post has outlined three themes that worked well in our pursuit of this goal from an engineering perspective.

Be on the lookout for a post from a data scientist’s perspective!Special thanks go to Emily Glassberg Sands, Vinod Bakthavachalam, Rachel Reddick, Phil Cayting, and Emily Keller-Logan for providing feedback on the drafts.

.

. More details

Leave a Reply