4 Key Aspects of a Data Science Project Every Data Scientist and Leader Should Know

  Synergies with the User Interface Module and Overall User Experience In all practical applications, the data science component is only an enabling technology and never the complete solution by itself.

Users interact with the end application through a User Interface (UI).

The User Experience (UX) should be designed in a way that is synergistic with the powers of the underlying data science component while camouflaging its shortcomings.

Let me illustrate an optimal way of synergizing UI/UX with the data science component using two different examples: Search engine Word processor   Search Engine A typical web search engine uses heavy data science machinery to rank and categorize WebPages.

It then returns the most relevant ones in response to the user’s query.

If the data science component can interpret the query with high confidence and extract the exact specific answer, the user interface can utilize this confidence to just display the answer as a ‘zero-click-result’.

This will lead to a seamless UX.

Source: searchenginejournal.

com Google applies this.

For queries like ‘prime minister of India,’ it returns the answers as ‘knowledge panels’.

On the other hand, when the data science confidence on the exact answer is below a certain threshold, it is safe to let the user interact with the system through a few more clicks to get to the specific answer rather than risking a bad user experience.

When we search for ‘where do MPs of India meet’, Google’s first link has the right answer but because the confidence on the exact answer snippet is lower, it doesn’t show a ‘knowledge panel’.

There is another way to exploit UI/UX synergy with the data science component.

The users’ interactions with the system can also be used to generate indirect ‘labeled data’ and also as a proxy on the system’s performance evaluation.

For example, imagine a scenario where a web search engine returns the top 10 results for a given query and the user almost always clicks on either the second or the third link.

This implies that the underlying data science component needs to revisit its ranking algorithm so that the second and the third links are ranked higher than the first one.

The ‘wisdom-of-crowd’ also provides the labeled pair of ‘query-and-relevant-Webpage’.

Admittedly, labeled pairs inferred in such a way will include a variety of user biases.

Hence, a nontrivial label normalization process is needed before these labels can be used for training the data science component.

  Word Processor Similarly, consider a typical spell-checker in a word processor.

The underlying data science machinery is tasked with recognizing when a typed word is likely a spelling mistake, and if so, highlighting the misspelled word and suggesting likely correct words.

When the data science machinery finds only a single likely correct spelling and that too with high confidence, it should auto-correct the spelling to provide a seamless user experience On the other hand, if there are multiple likely correct words for the misspelled word, each with a reasonably high confidence score, the UI should show them all and let the user choose the right one Similarly, if the multiple likely correct words have low confidence scores, the UI should camouflage this shortcoming by highlighting the spelling error without suggesting any corrections.

This again makes for a pleasant user experience Thus, the data science team must understand all transformations of the data science-driven output that will go through before reaching the end user’s hands.

And the UI designers and engineers should understand the nature of likely errors that the data science component will make.

The data science delivery leader has to drive this collaboration across the teams to provide an optimal end solution.

Also, notice that I mentioned “high confidence” and “low confidence” above, whereas what the machines will need is “confidence above 83%”.

This is the ‘qualitative to quantitative gap’ that we discussed in the previous article of this series.

  The Trade-off between Compute Cost and System Accuracy The next aspect that the teams have to build a common understanding of is about the nature of the user’s interaction with the end-to-end system.

Let’s take the example of a speech-to-text system.

Here, if the expected setup is such that the user uploads a set of speech files and expects an auto-email-alert when the speech-to-text outputs are available, the data science system can take a considerable amount of time to generate the best quality output.

On the other hand, what happens if the user interaction is such that the user speaks a word/phrase and waits for the system to respond?.The data science system architecture will have to be such that it trades for a higher compute cost to generate instantaneous results with high accuracy.

Knowing the full context in which the data science system will be deployed can also help to make informed trade-offs between the data science system’s ability to compute efficiency and overall accuracy.

In the above example of speech-to-text, we know that the end-to-end system restricts the user to speak only the names of people in his/her phone-book.

So here, the data science component can restrict its search space to the names in the phone-book rather than searching through millions of people’s names.

The amount of computing power needed for training and executing the machine learning component typically grows linearly at lower accuracy numbers and then grows exponentially at higher accuracy numbers.

The cost of running and maintaining the solution should be a lot lesser than the revenue attributed to the solution for the machine learning solution to be monetarily viable.

This can be achieved in a couple of ways: Holding discussions with the product team, the client team and the engineering teams to establish the sensitivity of the overall system to the accuracy of the data science solution.

This can help establish what is a reasonable accuracy to aim for Reducing the number of times the most complex data science sub-component gets invoked.

Once the most complex data science sub-component is identified, we need to identify the data samples on which this data science sub-component gets invoked repeatedly.

Creating a lookup table of these common input-output pairs will increase the overall efficiency of the system As an example, in enriching financial transactions in my current setup, such optimizations led to a drop of about 70% in the compute cost at an expense of only a few GB increase in the RAM for the lookup table A popular example where implementation and maintenance costs trump the gained accuracy is Netflix’s decision to not use the million-dollar-award-winning solution which would have otherwise led to about a 10% increase in its movie-recommendation accuracy   Model Interpretability Yet another practical consideration that should be high on our list is ‘model interpretability‘.

By being able to interpret why a given data science model behaved in a particular way helps prioritize changes in the model, changes in the training samples, and/or changes to the architecture to improve the overall performance.

In several applications like the loan-eligibility prediction we discussed above, or precision medicine, or forensics, the data science models are, by regulation, required to be interpretable so that human experts can check for unethical biases.

Interpretable models also go a long way in building stakeholder trust in the data-driven paradigm of solving business problems.

But, the flip side is that often the most accurate models are also the ones that are most abstract and hence least interpretable.

Thus, one fundamental issue that the data science delivery leader has to address is the compromise between accuracy and interpretability.

Deep Learning-based models fall in the category of higher-abstraction-and-lower-interpretability models.

There is a tremendous amount of active research in making deep learning models interpretable (e.


, LIME and Layer wise Relevance Propagation).

  End Notes In summary, a high accuracy data science component by itself may not mean much even if it solves a pressing business need.

On one extreme, it could be that the data science solution achieves high accuracy at the cost of high compute power or high turnaround time, neither of which are acceptable by the business.

On the other extreme, it could be that the component that the end-user interacts with has minimal sensitivity to the errors of the data science component and thus a relatively simpler model would have sufficed the business needs.

A good understanding of how the data science component fits into the overall end-to-end solution will undoubtedly help make the right design and implementation decisions.

This, in turn, increases customer acceptance of the solution within a reasonable operational budget.

I hope you liked the article.

Do post your comments and suggestions below.

I will be back with the last article of this series soon.

You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.. More details

Leave a Reply