Data Science Practice 101: Always Leave An Analysis Paper Trail

It’s often best practice NOT to check large datasets into your git repo, so there’s always a disconnect between checked in code and resulting data artefacts.

This isn’t a question about are is there a record “somewhere out there”, but is the record readily accessible when it is needed, at any time and place when question is brought up, within seconds.

Cut to the chase, what are some example solutions?Here’s a non-exhaustive list of ways to keep things together.

Use what feels most natural for all stakeholders.

Excel files: Make a tab for your raw data dump, a tab for the query, a tab for the analysis.

Put links/references if something is too big to fit.

CSV files: You’ll want to compress your data for sending/archival anyways, tar/zip/bz2/xz it up with your query.

sql file, any processing code, etc.

Slide decks: Depending on audience and forum, either appendix slides w/ links to analysis calculations/documentation, or docs placed in speaker’s notesDashboards: Tricky, links on the UI if feasible, or comments/links hidden within the code that generates the specific dashboard elements.

Email reports: Provide a link to a more detailed source, or a reference to the relevant data.

Jupyter/colab notebooks: Documentation should be woven into the code and notebook itself, there’s those text/html blocks for a reason.

Production models: Code comments and/or links pointing back to the original analysis the model stands on, or at least the analysis that generated any parameters.

Anything else I should do?Date your files— Most analysis, especially ad-hoc ones, have context that is rooted in time — quarterly board meetings, release of a new feature, etc.

Stuff from 2017 is usually less relevant in 2019.

When all context is lost but someone can produce an email of an announcement around the time of deliverable was sent out, you have a date to go searching for.

My personal habit of dating analysis and deliverable filesMake queries that give the same result regardless of when they are run.

Very often it’s tempting to do queries that just “pull everything” or “last 7 days”, but the one flaw they have is that the data changes depending on when you run the query, even 10 minutes later.

This makes it impossible to reproduce the results of a query without modifying it, which is probably undesirable.

In some situations, it makes a lot of sense to make queries with dynamic time windows, and in others it’s not.

Be conscious of your potential future use case (will people ask you to re-run it with updated data, etc) while making your decision.

It’s a chaotic world out there.

Try to stay organized in your own tiny domain.


. More details

Leave a Reply