Lessons from a real Machine Learning project, part 1: from Jupyter to Luigi

And what about notebooks?Two questions, the same answer.

Of course, notebooks were not eliminated from our development workflow.

They are still an awesome prototyping platform and an invaluable result sharing tool, aren’t they?We just started using them for the purposes they were created in the first place.

Notebooks became personal, thus avoiding any git pain, and subject to a strict naming convention: author_incremental number_title.

ipynb, to be easily searchable.

They remained the starting point of every analysis: models were prototyped using notebooks.

If one happened to outperform our state-of-the-art, it was integrated in the production Python code.

The concept of outperforming was here well-defined, as scoring procedures were implemented in utility modules and shared by all members of the team.

Notebooks also made up most of the documentation.

It took only a few days for us as a team to complete the transformation.

The difference was incredible.

Almost overnight, we unlocked the power of collective code ownership, unit testing, code reusability, and all the legacy of the last 20 years of Software Engineering.

A great boost in terms of productivity and responsiveness to new requests was the obvious consequence.

The most evident proof came when we realized we were missing unit of measure and labels from all the goodness-of-fit charts.

As they were all implemented by a single function, it was fast and easy to fix them all.

What would have happened if the same charts were still copied and pasted in lots of notebooks?At the end of the transformation, our repository looked like this.

├── LICENSE├── README.

md <- The top-level README for developers│├── data│ ├── interim <- Intermediate data│ ├── output <- Model results and scoring│ ├── processed <- The final data sets for modeling│ └── raw <- The original, immutable data dump│├── models <- Trained and serialized models│├── notebooks <- Jupyter notebooks│├── references <- Data explanatory materials│├── reports <- Generated analysis as HTML, PDF etc.

│ └── figures <- Generated charts and figures for reporting│├── requirements.

yml <- Requirements file for conda environment│├── src <- Source code for use in this project.

│ ├── tests <- Automated tests to check source code │ ├── data <- Source code to generate data │ ├── features <- Source code to extract and create features │ ├── models <- Source code to train and score models │ └── visualization <- Source code to create visualizationsA win-win situation.

The end: the value of frameworksWe were satisfied with the separation of Jupyter prototypes and Python production code, but we knew we were still missing something.

Despite trying to apply all the principles of clean coding, our end-to-end scripts for training and scoring became a little bit messy as more and more steps were added.

Once again, we figured out there was some flaw in the way we were approaching the problem and we looked for a better solution.

Once again, valuable resources came to rescue:Norm Niemer, 4 Reasons why your Machine Learning code is probably bad, 2019Lorenzo Peppoloni, Data pipelines, Luigi, AirFlow: everything you need to know, 2018We studied AirFlow, Luigi and d6tflow and we finally opted for a Luigi/d6tflow pipeline, with the latter used for simpler tasks and the former for more advanced use cases.

This time it took just a single day to implement the whole pipeline: we saved all our functions and classes, encapsulating the logic for preprocessing, feature engineering, training and scoring, and we replaced scripts with pipelines.

The improvements in readibility and flexibility were significant: when we had to change how train and test sets were splitted, we could modify only two tasks, preserving input and output signatures, without worrying of everything else.

Wrapping upWrapping up, we got three important lessons about code in a Machine Learning project:A Machine Learning project is a software project: we shall take care of the quality of our code.

Having to deal with statistics and math is no excuse for writing bad codeJupyter Notebooks are great prototyping and sharing tools, but are no replacement for a traditional code base, made of modules, packages and scriptsThe Directed Acyclic Graph (DAG) structure is great for Data Science and Machine Learning pipelines.

There is no point in trying to create such a structure from scratch when there are very good frameworks to helpThank you, my reader, for getting here!This is my first article on Medium and by no means I am a good writer.

If you have any comment, suggestion, or critic, I beg you to share with me.

Also, if you have any question or doubt about the topic of this post, please feel free to get in touch!.

. More details

Leave a Reply