Coding habits for data scientists

By David Tan, ThoughtWorks.

As a ML practitioner, you would know that code can get out of hand, quickly.

What starts as an awesome ML model easily becomes a big blob of code thats hard to understand.

As a consequence, modifying code becomes painful and error-prone, and it becomes increasingly difficult for ML practitioners to evolve their ML solutions.

This article shares some techniques for identifying bad habits that add to complexity in code as well as habits that can help us partition complexity.

Its also now a video series that cover topics such as:If you’ve tried your hand at machine learning or data science, you know that code can get messy quickly.

Typically, code to train ML models is written in Jupyter notebooks, and it’s full of (i) side effects (e.

g.

, print statements, pretty-printed dataframes, data visualisations) and (ii) glue code without any abstraction, modularisation, and automated tests.

While this may be fine for notebooks targeted at teaching people about the machine learning process, in real projects, it’s a recipe for an unmaintainable mess.

The lack of good coding habits makes code hard to understand, and consequently, modifying code becomes painful and error-prone.

This makes it increasingly difficult for data scientists and developers to evolve their ML solutions.

In this article, we’ll share techniques for identifying bad habits that add to the complexity in code as well as habits that can help us partition complexity.

  One of the most important techniques for managing software complexity is to design systems so that developers only need to face a small fraction of the overall complexity at any given time.

– John OusterhoutTo tackle complexity, we must first know what it looks like.

Something is complex when it’s composed of interconnected parts.

Every time we write code in a way that adds another moving part, we increase complexity and add one more thing to hold in our head.

While we cannot — and should not try to — escape from the essential complexity of a problem, we often add unnecessary accidental complexity and unnecessary cognitive load through bad practices such as:Complexity is unavoidable, but it can be compartmentalized.

In our homes, when we dont actively organise and rationalise where, why, and how we place things, mess accumulates and what should have been a simple task (e.

g.

, finding a key) becomes unnecessarily time-consuming and frustrating.

The same applies to our codebase.

New code is constantly being added for data cleaning, feature engineering, bug fixes, handling new data, and so on.

Unless we vigilantly maintain our codebase and continuously refactor (and we can’t refactor without unit tests), mess and complexity are guaranteed.

In the remainder of this article, we’ll share some common bad habits that increase complexity and better habits that help to manage complexity:  Keep code cleanUnclean code adds to complexity by making code difficult to understand and modify.

As a consequence, changing code to respond to business needs becomes increasingly difficult and sometimes even impossible.

One such bad coding habit (or “code smell”) is dead code.

Dead code is code that is executed but whose result is never used in any other computation.

Dead code is yet another unrelated thing that developers have to hold in our heads when coding.

For example, compare these two code samples: Clean code practices have been written about extensively in several languages, including Python.

We’ve adapted these “clean code” principles, and you can find them in this clean-code-ml repo: Use functions to abstract away complexityFunctions simplify our code by abstracting away complicated implementation details and replacing them with a simpler representation — its name.

Imagine you’re in a restaurant.

You’re given a menu.

Instead of telling you the name of the dishes, this menu spells out the recipe for each dish.

For example, one such dish is:Step 1.

In a large pot, heat up the oil.

Add carrots, onions and celery; stir until onion is soft.

Add herbs and garlic and cook for a few more minutes.

Step 2.

Add in lentils, add tomatoes and water.

Bring soup to a boil and then reduce heat to let it simmer for 30 minutes.

Add spinach and cook until spinach is soft.

Finally, season with vinegar, salt and pepper.

It would have been easier for us if the menu hid all the steps in the recipe (i.

e.

, the implementation details) and instead gave us the name of the dish (an interface, an abstraction of the dish).

(Answer: that was lentil soup).

To illustrate this point, here’s a code sample from a notebook in Kaggle’s Titanic competition before and after refactoring.

 What did we gain by abstracting away the complexity into functions?When we refactor to functions, our entire notebook can be simplified and made more elegant:# bad exampleSee notebook Our mental overhead is now drastically reduced.

We’re no longer forced to process many many lines of implementation details to understand the entire flow.

Instead, the abstractions (i.

e.

, functions) abstract away the complexity and tell us what they do, and save us from having to spend mental effort figuring out how they do it.

Smuggle code out of Jupyter notebooks as soon as possibleIn interior design, there is a concept (the “Law of Flat Surfaces”) that states “any flat surface within a home or office tends to accumulate clutter.

” Jupyter notebooks are the flat surface of the ML world.

Sure, Jupyter notebooks are great for quick prototyping.

But its where we tend to put many things — glue code, print statements, glorified print statements (df.

describe() or df.

plot()), unused import statements, and even stack traces.

Despite our best intentions, so long as the notebooks are there, mess tends to accumulate.

Notebooks are useful because they give us fast feedback, and that’s often what we want when we’re given a new dataset and a new problem.

However, the longer the notebooks become, the harder it is to get feedback on whether our changes are working.

In contrast, if we had extracted our code into functions and Python modules, and if we have unit tests, the test runner will give us feedback on our changes in a matter of seconds, even when there are hundreds of functions.

Figure 1: The more code we have, the harder it is for notebooks to give us fast feedback on whether everything is working as expected.

Hence, our goal is to move code out of notebooks into Python modules and packages as early as possible.

That way, they can rest within the safe confines of unit tests and domain boundaries.

This will help to manage complexity by providing a structure for organizing code and tests logically and make it easier for us to evolve our ML solution.

So, how do we move code out of Jupyter notebooks?Assuming you already have your code in a Jupyter notebook, you can follow this process:Figure 2: How to refactor a Jupyter notebook.

The details of each step in this process (e.

g.

, how to run tests in watch mode) can be found in the clean-code-ml repo.

Apply test-driven developmentSo far, we’ve talked about writing tests after the code is already written in the notebook.

This recommendation isn’t ideal, but it’s still far better than not having unit tests.

There is a myth that we cannot apply test-driven development (TDD) to machine learning projects.

To us, this is simply untrue.

In any machine learning project, most of the code is concerned with data transformations (e.

g.

, data cleaning, feature engineering), and a small part of the codebase is actual machine learning.

Such data transformations can be written as pure functions that return the same output for the same input, and as such, we can apply TDD and reap its benefits.

For example, TDD can help us break down big and complex data transformations into smaller bite-size problems that we can fit in our head, one at a time.

As for testing that the actual machine learning part of the code works as we expect it to, we can write functional tests to assert that the metrics of the model (e.

g.

, accuracy, precision, etc.

) are above our expected threshold.

In other words, these tests assert that the model functions according to our expectations (hence the name, functional test).

Here’s an example of such a test: Make small and frequent commitsWhen we don’t make small and frequent commits, we increase mental overhead.

While we’re working on this problem, the changes for earlier ones are still shown as uncommitted.

This distracts us visually and subconsciously; it makes it harder for us to focus on the current problem.

For example, look at the first and second images below.

Can you find out which function we’re working on? Which image gave you an easier time?When we make small and frequent commits, we get the following benefits:So, how small of a commit is small enough? Try to commit when there is a single group of logically related changes and passing tests.

One technique is to look out for the word “and” in our commit message, e.

g.

, “Add exploratory data analysis and split sentences into tokens and refactor model training code.

” Each of these three changes could be split up into three logical commits.

In this situation, you can use git add –patch to stage code in smaller batches to be committed.

  “Im not a great programmer; Im just a good programmer with great habits.

” – Kent Beck, pioneer of Extreme Programming and xUnit testing frameworksThese are habits that have helped us manage complexity in machine learning and data science projects.

We hope it helps you become more agile and productive in your data projects as well.

Original.

Reposted with permission.

 Bio: David has been with ThoughtWorks for 2 years, and was working in the government sector in a non-technical role before he decided to embark on a career in software engineering.

 Over the last two years, he has worked on several machine learning side projects on tasks such as stock market price prediction, fraud protection, and beer quantity image recognition.

He is also a trainer for the ThoughtWorks JumpStart! program.

David is passionate about agile software development and knowledge sharing.

 During his free time he enjoys spending time with his family as a new dad.

Related: var disqus_shortname = kdnuggets; (function() { var dsq = document.

createElement(script); dsq.

type = text/javascript; dsq.

async = true; dsq.

src = https://kdnuggets.

disqus.

com/embed.

js; (document.

getElementsByTagName(head)[0] || document.

getElementsByTagName(body)[0]).

appendChild(dsq); })();.

Leave a Reply