Data Science Foundations: Know your data. Really, really, know it

, c characters in useragent strings clogging a csv pipelineSome jerk managed to put into a thingField names that get repurposed for a completely different use because Eng didn’t want to run an ALTER due to downtime.

(You know who you are, “logo” field)Data dumps from gov’t COBOL code on a mainframe somewhereOrphan IDs because no one really uses foreign key constraints in prodAll the Fun™ you can imagine and even more you can’tDealing with this layerThe utter insanity of this layer is why most data practitioners seem to spend the majority of their time cleaning and preparing data.

Between bugs introducing erroneous data, malicious/naive users giving you weird data, and occasional bad system design, there’s no end to the list.

We haven’t even considered what happens if you scrape the web, or have to extract data from PDF or “pretty-formatted Excel sheet”.

In fact, my tongue in cheek definition of a junior data scientist is someone who doesn’t violently recoil from the idea of putting an open text box on the internet.

Most of these data issues will likely crash any code you write to do analysis.

which is usually a good thing.

If you’re lucky (unlucky?) enough that your code runs despite weird data input, it’ll invalidate any conclusions you make and you need to be on-point to realize what’s happened.

When in doubt, ask other people when you see weird data.

It’s very often a bug that should be fixed.

#3 Know of business data quirksBusinesses often collect weird special cases along the way, those special cases can trip you up even harder than a random NULL in your ID field.

They’re dangerous because they manifest as valid data points but they behave massively different.

Examples of these I’ve seen include:Internal users, for testing, employee, or “friends of the business” use.

They probably use things differently from everyone elseStrategic partners, maybe they have massively larger quotas and activity and are billed at a discount, or they get features earlyReseller accounts that effectively control 50 accounts worth of activity under one accountCalendars in general.

National holidays will mess with your data, month lengths mess with your aggregations.

I’ve got a burning dislike for Easter purely because it’s a different date every year and throws YoY comparisons for a loop twice a yearHow to deal with this layerDomain experts and partners all across the business are the key to dealing with this kind of data.

All these things are part of the institutional knowledge you need to tap into in order to make sense of the data you see.

The only other guard rail you have is being vigilant about the distribution of activity and users.

These special case entries tend to stand out from a more typical customer in some way, so you can hunt them down as if they were a big, then be corrected partway through.

#4 Know where the data comes from, how it’s generated and definedIn science, we’re supposed to meticulously document how data was collected and processed, because the details of that collection process matter.

Tons of research has been invalidated based on the fact that there was a flaw in how data was gathered and used.

In our case, technology implementations matters here a ton, so break out the your Eng hats.

Do you depend on cookies?.That means people can clear them, block them, or they expire due to short TTL.

People use multiple browsers and devices.

A simple example being: “unique cookie” isn’t the same as “a unique human user”, mix those up and you’re in for a bad time.

Do you use front end JavaScript to send events like clicks and scrolls back to your systems?.Does it work on all the browsers?.Sometimes, remember people block javascript and bots rarely run JS.

What is catching the events at home base?.What machine records time?.Do events fire right before or right after the API all we care about?If things are being tracked on the database when does the update happen?.Is it all wrapped in a transaction?.Do state flags change monotonically or freely back and forth?.What’s the business logic dictating the state changes?.Is it possible to get duplicate entries?How does your A/B testing framework assign subjects, is it really assigning variants randomly without bias?.Are events being counted correctly?Geospatial data?.Have fun with the definitions of metro areas, handling zip and postal codes.

Queens County, NY is a aggregation of a bunch of smaller names btw.

IP data?. More details

Leave a Reply