Best Practices for Using Notebooks for Data Science

By Armin Wasicek, Sumo LogicIn the data science world, notebooks have emerged as an important tool – they are active documents created by individuals or groups to write and run code, display results, and share outcomes and insights.Like every other story, a data science notebook follows a certain structure that is typical for its genre..In essence, a notebook should record an explanation of why experiments were initiated, how they were performed and then display the results.In order to explain notebooks, lets take a step back to understand their anatomy, discuss human speed versus machine speed, explore how notebooks can increase productivity, and outline the top five best practices for writing notebooks.   A notebook segments a computation in individual steps called paragraphs..Paragraphs must not contain computations, but can contain text or visualizations to illustrate the workings of the code.   The power of the notebook roots in its ability to segment and then slow down computation..Certain paragraphs are dedicated to make progress in the computation, i.e., advance the state, whereas other paragraphs would simply serve to read out and display the state..So when developing a notebook, the user builds up state and then iterates on that state until progress is made..A notebook will most likely keep all its state in the working memory whereas every new execution of a stand-alone program will need to build up the state on every time it is run.This takes more time and the required IO operations might fail..Better write two or more notebooks than overloading a single notebook.   A common source of confusion is when program state gets passed on between paragraphs through hidden variables..Referencing variables from other paragraphs than the previous one should be avoided.   A notebook integrates code, it is not a tool for code development..The tool for code development is an Integrated Development Environment (IDE).. More details

Leave a Reply