It’s OK to use spreadsheets in data science


” — EveryoneMost practitioners I’ve spoken to agree that using a spreadsheet in exploratory/prototyping way is totally fine to do (they might not do it themselves, but they don’t see much wrong with it).

You’re typically messing around with a small sample of data to figure out what is possible.

There’s little expectation that you’ll be producing refined code or analysis at this point.

Depending on your familiarity with Excel, Google Sheets, or Libreoffice vs pandas and R, you might even be able to work faster.

I’m really obsessed with being familiar with my data sets before I build out an analysis or write code.

Sure, you can look at descriptive statistics, but they can lie to you, a lot:Such an awesome animationThe only real way to get a good sense of the data is to look at distributions, visualizations, and directly sampling it in raw form.

Spreadsheets are generally great for this.

I tend to find it less clunky than using pandas to poke around at arbitrary chunks of rows.

The key feature is how quickly you can experiment with data transformations in a spreadsheet.

A quick VLOOKUP will let you test a simple join between two messy tables to find the broken or duplicate keys.

You can toss data into a pivot table in a few keystrokes and clicks and check out certain aggregations.

A lot of bugs can be spotted this way because you’ll attempt a simple bit of arithmetic, it breaks because of wonky data, and you can quickly examine what actually happened and add it to your list of data errors to clean out.

Similar operations in pandas, while also quite easy, takes more typing and keeping details straight in your head as to what functions have been applied to what, etc.

So the ideal case here would be to use a spreadsheet to explore, prototype out what data transformations you want to apply, check some distributions, then go and code up a proper model in Python or R now that everything’s clearer in your head.

The trick to know when to stop is if you’re seriously considering writing a macro or something, stop.

Dealing with ridiculous semi-repetitive codingTracking systems are FULL of arbitrary combinations of arbitrary strings and numbers that you need to query against to count.

Maybe there’s an underlying pattern that works for your use case that let’s you meta-program something, or maybe not.

This is especially an issue for systems that have multiple people over the years adding/changing various tracking codes with different naming conventions, even different back-end stacks.

An example of a typical tracking system that’s had a few years to grow production cruftMany times, there’s no other way to deal with data sets like the above other than writing some kind of brittle hard-coded mapping function of some kind.

It’s honestly a challenge to keep everything consistent and documented over years of production, the mix of camelCase and underscores points to that.

Doing a simple aggregation for meaningful analysis is an utter pain in the butt.

— ProductLead: "Where do my registrations come from?"– Analyst: "Why do you hate me?"select case when event_string like 'navRegButton' then page_path when event_string like 'homepage_register_button' then 'homepage' when event_string like 'newsletterRegistration' then 'newsletter'when event_string like 'adCampaign_%' then 'ad'when event_string like 'homepageBiggerRegButton' then 'homepage'end as source,count(*) as reg_startsfrom ui_eventsgroup by sourceComing up with that giant case statement like the above is a pain, keeping it organized is even worse.

I call it semi-repetitive because if it were actually repetitive, you’d be able to write a function that would capture its behavior fairly easily.

This one has a bunch of stupid custom edge cases because of tech debt, so it has undesirable Kolmogorov complexity .

Enter using spreadsheets to keep things somewhat sane.

Sometimes you gotta break some eggs to make a registration source trackerIs this ideal?.Probably not.

But I’m not even sure what the ideal actually looks like in this situation.

The alternative would be to hard-code all this into Python for analysis which is just as brittle at this mess.

A cleaner solution is to somehow refactor how your tracking codes are set up past, present, and future and create a system that prevents this from happening, but I’m not smart enough to envision such a system that won’t be abused by some future use.

Sharing packaged data with less technical folkThere’s a reason why lots of BI tools of all levels have a kind of “export to CSV/Excel” feature.

Lots of very smart analytic people don’t know much about coding in Python or R.

They’re not going to look at a notebook of code and go “Aha!.Let me make some tweaks to check my own hypothesis.

” But, if you hand them a table of fairly cleaned up data, they’ll handle themselves quite well.

So why not have just a CSV, the universal data transfer format?.You can, but it makes leaving a data source trail more work.

You can package all the relevant information needed to pull a data set into a tab in the spreadsheet, whether it’s relevant queries, links to scripts, whatever.

Always leave a data trailThe reason for this is to make sure that when the inevitable “where did this come from?” question comes from a 3rd party, there’s actually an answer to it, without having to do a ton of crazy time-wasting detective work.

Mapping when you’re a map newbieMaps are insanely hard to make, even if you have exact Lat/Lon coordinates to work with.

Mapping zip codes is even more insane to make (things like FIPS to Zip conversion is a non-trivial pain).

Most examples I’ve seen involve a significant amount of packages to work and that’s assuming you just want a static image.

On the bright side, we don’t have to boot up GIS software to do this sort of thing.

Thankfully, and rather amazingly, for those us who can’t tell the difference between Mercator from Mercedes, spreadsheets in recent years have started to fill this niche of “I need a quick map viz right now”, Excel being at the forefront of this as of this writing in 2019.

Google Sheets also has some mapping features but it’s not as powerful still.

Excel 3d Mapping feature: It’s like MAGICGoogle Sheets Maps: Not as magical, but still pretty magicMaking these maps is a high technical feat, you have the crazy map outlines to deal with, there’s usually AI/ML involved in interpreting whether a piece of data is a zip code or some other location, map locations often have duplicated names, data points need to be aggregated depending on the display zoom level… So having all of that work abstracted for you is a huge amount of value.

Conclusion — spreadsheets make good throwawaysTo the extent that spreadsheets need to be maintained, they represent a form of technical debt.

The trick is to not let it be technical debt, by avoiding the use of spreadsheets as long-term living documents.

While we’re often trained to not generate throwaway work, paradoxically, spreadsheets are almost ideal for fast throwaway work.

It’s often when we keep them around that they become an eldritch horror.


. More details

Leave a Reply