6 uncommon principles for effective data science

Choosing how to clean dirty data or what features to use will affect the accuracy of the model, but the biggest factors that determine if a machine learning model can be deployed into production come from the questions you ask at the start of the project.Early decisions have disproportionately large impact on projectsIt is easy to understand this point on the surface, and yet few people consider the implications.In order to build a model that has an actionable outcome, the following has to be done before a single line of code is typed: communicating with ends users, understanding business rules that are already in place, looking at existing data and thinking of how to define the “ground truth” for the model to be trained on such that it is meaningful for end users, understanding the limitations of the proposed model, and stress testing the predicted output of the model with business users.Here’s an example to illustrate this point.Building a churn prediction modelYou are approached by the marketing department to build a machine learning model to predict which customers have churned..The marketing department tells you that having such an output is useful so they can better re-target customers to improve customer retention..As a data scientist, this seems like a simple, straightforward task..Build a binary classification model with output of 1 being a churned customer and 0 being an active customer..In fact, you tell your colleagues that you can one-up their request by giving them the probability that a customer churns.Further discussions take place, and you learn that marketing defines churned customers as customers who have not ordered in a year or more..You decide to follow their existing logic and label your model’s targets as such.Your present your plan, and the marketing department gives you the go ahead to build the model..You spend weeks pulling and cleaning data, defining your loss function, determining your cross-validation strategy, training and tuning your model, validating your results, and end up with an AUC of 0.9..You are elated with the results and proudly present your findings to the marketing team, telling them what data and model was used, and how to interpret AUC.The marketing department has comments:Subscribers who recently signed up but not made orders are labelled by the model as churned users..This should not be the case as the order cycle is longer..More broadly, users who sign up at different times should have different definitions of churn..A frequent customer who does not buy in six months may be considered as churned, while a new user who has not bought in six months may not be churned, because the normal buying cycle is a year or more.The marketing department realizes that customers predicted by the model to churn are already targeted from existing retention efforts..They implemented a simple business rule, where customers who have not purchased in six months are already given the max discount possible.. More details

Leave a Reply