My Best Tips for Agile Data Science Research

Sometimes the machine learning model will replace some simple heuristic an even 65% percent accuracy will be very valuable for the business.

We need to define what is a success.

Always compare to a baseline modelWhat is a good performance is a pretty hard question which is heavily based on how hard is the problem and what are the business needs.

My advice is to start your modeling by building a simple baseline model, it can be a simple machine learning model with basic features or even a business rule (heuristics) like the average label in an important category.

This way we can measure our performance in comparison to the baseline and monitor our improvement in the task.

Start with a simple modelIterations are one of the core characteristics of agile development.

In a data science project, we don’t iterate on features like the engineering team, we iterate on models.

Starting with a simple model with a small number of features and making it more and more complex iteratively has many advantages.

You can stop at any point when your model is good enough and save time and complexity.

You know exactly how every change you made has affected the model performance and this gives you intuition for your next experiments and maybe most importantly, by adding complexity iteratively you can debug your model for bugs and data leakages much easier and faster.

Plan sub-goalsPlanning research projects is hard because they have a very large amount of uncertainty.

From my experience, it is best to plan your projects using subgoals, for example, data exploration, data cleaning, dataset building, feature engineering, and modeling are small parts of the research that you can plan at least a few weeks forward.

These sub-goals can bring value on their own without the final model.

For example, after data exploration, the data scientist can bring actionable insights for the business people and data set cleaning and building can help other data scientist and analysts for their own projects immediately.

Fail fastFailing fast is maybe my most important point and probably the hardest to do.

At each iteration, you must ask yourself what is the probability that the model performance will reach the minimum valuable KPI?.I think that making the model more complex iteratively really helps in this part.

Adding more features and trying more models usually gives incremental improvements.

If your model performance is 70% and your minimum valuable KPI is 90% you are probably not going to get there, so, you need to stop your project and move to the next problem or change something drastic like changing your label or tagging much more data.

I am not saying that you shouldn’t try to solve very hard problems, just make sure that you are not wasting time on methods that probably won’t reach your project goals.

Move to production ASAPMy last advice is deploying your model in production at the earliest point or a little after the point that your model is valuable.

I know that maybe your final model will have totally different features and a lot of the work will be wasted.

But, first, your model gives value, why wait?.Secondly and more importantly, in many cases, the production has its own constraints, some features are not available at the production systems, some features are in different formats, maybe your model is to slow or uses to much RAM etc.

Solving these problems early can save a lot of unrealistic modeling time.

Hope you enjoyed my post, and you’re more than welcomed to read and follow my blog.

.. More details

Leave a Reply