A stitch delayed — a modest fix for the biggest small problem in data science

(by which I mean if we just wrote down the two numbers — let’s say a mean of 1 kg and a spread of 200 g — you could draw a curve that looks acceptably similar to our data without ever having seen it)So sensibly applying a normal distribution to appropriate data and recovering two numbers is a great idea.

If you can do it, do!But much of the time, as is the case with our x data, we cannot.

There is no way to describe asymmetry in these two parameters, we need a third.

We have lots of options but I want to map out 3 — the good, the bad and the ugly.

The ugly (but actually perfectly good) method — mean, standard deviation and skewWhat if we made a normal distribution out of wet-sand, and tilted it?We’d get a skewed distribution, with two assymetric tails.

If we tilt our sandpile to the left we get negative skew (more of the distribution above the mean) and to the right positive.

However, we’re still describing our data via the mean and standard deviation, now just with an extra, esoteric parameter.

Plenty of readers may have furrowed brows about the choice of which skew is positive and negative, and more so when they’re told how to calculate it.

With these three numbers we almost certainly could reconstruct the distribution of x — but drawing it is a nightmare.

We have all the information we need, but just in an abstract form — there are no easy error bars and no obvious way to quote and plot the result.

It’s a perfectly good way of doing things, if you like that sort of thing, but it’s ugly and unintuitive (come at me tho).

The bad method — 16–50–84I have not yet delivered on my promise of explaining these numbers (a plot synopsis of Lost*).

Let’s fix that.

*probably, I stopped watching16% of the area under a normal distribution lies to the left of one-standard deviation below the mean.

84% of the area under a normal distribution lies to the left of one-standard deviation above the mean.

Thus if we have normally distributed data these two numbers, when compared to the mean, tell us the standard deviation.

Twice.

And when the distribution is not perfectly normally distributed, these two numbers tell us… nothing?There’s a fundamental problem here — 16 and 84 (and 50 for that matter) are three numbers relevant only to the strictly symmetric normal distribution — and yet we are trying to use them to describe a different, asymmetric distribution.

Theres a disconnect between what we’re measuring and how we’re measuring it.

Obviously there’s a relation between the two, but there are some non-linear steps we must take to go from one to the other.

This is like measuring the biomass of a forest in terms of the weight of the seeds.

There’s information there, but it’s abstracted away from being useful.

As a great demonstration of this, try and draw a distribution given the 16th, 50th and 84th percentile values.

If you’ve drawn anywhere between 0 curves and an infinite number you’ve missed the mark.

These three numbers do not sufficiently constrain a curve and do not relate to any model.

Let that sink in for a second.

Hello grainy old friendThe distribution of x in our experiment is a great example of this.

Those three dashed lines show the 16–50–84%.

Imagine taking away the distribution and just leaving those — then passing the plot on to someone else and asking them to explain the distribution.

Who would look at those three lines and think to put the peak waaay over to the left.

Who would guess at the length of that tail to the right.

Who would, if this was given as a result, be able to interpret the true form of the data.

At this point you may be thinking “yeah, that methods terrible — so why would anyone do it?”Or you may be thinking “Yeah, it’s not great, but it gets the job done — don’t rock the boat”The first response is one from anyone who believes in changing conventions for sensible reasons.

Sweden — shortly after the side of the road you drive on was legally changedThe second is someone who knows that changing conventions is costly and leads to a period of confusion, often not worth the gain from fixing a small problem.

The title of this article references the biggest small problem — and it is a small problem — likely to account for errors and misrepresentations of only a few percent.

There are obviously better methods, and many use them, but almost all have extra cost.

You could pass your full data-set — a perfect representation of the data but heavy.

It takes up space, requires time and effort to find and reintegrate it, and may have nuances which the authors well understand but an inheritor may never know.

You could fit the best model you can think of to every dataset — which is great, but fitting a curve to data is not a perfect art, and reporting your results requires not just stating the numbers but the distribution used and the methods by which it was derived.

Both of these are completely legitimate and widely used methods — but they are time and thought intensive.

The reason the pragmatic convention of 16–50–84 has continued to exist is that it is easy, quick and understandable.

It’s simple to explain what you have done and how you have done it.

There’s nothing wrong with pragmatic short-cut conventions, in fact I think that they generally pay more dividends (in saved time and mental energy) than they cost in accuracy.

So what if I told you that there was a method equally simple (at least in terms of computational complexity) that will work on any dataset and gives results that well represent the form of a much wider array of data?What if I tried to propose a new minimum standard of result reporting?You’d probably think I was an overexcited monkey with a typewriter claiming to have written a new playcredit — Amanda Cassingham-Bardwell (can I use this image, please? if not let me know :))The good method — direct calculation of the split normalThe split normal is one of the easier answers in a pub-quiz round about obscure statistical distributions.

It is oft discovered, rediscovered and forgotten*.

It’s what you’d get if you took two different normal distributions (each with their own standard deviation), cut them in half and stitched them back together wrong.

*including by me, until I found a great document detailing all the other people who claimed discoveryThe left-hand side has a standard deviation of 1, whilst the right-hand side deviation varies.

When they are both equal we just have a normal normal distribution (which is what I’ve taken to calling it to reduce confusion)The reason it’s discovered is that it gives a useful easy way to describe an assymetric distribution.

In three numbers it maps out the mode* of the data — it’s peak — and the rate at which the data falls off either side of this mode.

The three numbers are immediately tangible (unlike the skewness) and actually give meaningful results (unlike 16–50–84).

*remember when you were taught about that in school, applied it to a question about shoe sizes and never heard of it againThe reason it’s forgotten is that it’s only slightly useful — there’s plenty of other distributions that fit similarly well and some even have a few more enviable properties (whilst the normal normal and log normal distribution* occur relatively commonly in nature, the split normal is inherently artificial and whilst it may describe data well it will never be “perfect”)*distributionHowever, I’m here to give you a BRAND NEW advantage that I think makes the split normal a viable replacement as the minimum pragmatic standard of result reporting — I found a way to make it directly calculable.

By directly calculable I mean that you can fit a split normal to your data by only doing a small number of basic mathematical operations (like the mean, which just requires addition, or the median where you only need to sort the data) unlike methods where you have to fit the data by trying a bunch of different models and finding the best one (like a maximum likelihood estimation).

I’m going to leave a full description of how it’s done to a dryer source — a paper I’ve written up on the subject (and have no idea where to publish — hence the less than glamorous current publication of choice — some file in my Google Drive — suggestions welcome!)This means that finding the split normal parameters is as easy as, and much more representative than, the 16–50–84th percentile values.

So what do we do about it? Well, my advice is to start using it 🙂 Follow that link and you’ll find a (very basic) python script* that can model it, fit it and randomly sample from it.

*if anyone reading knows how to get such a thing into a numpy (or scipy library) that would seem a natural place for it.

This is all about lowering the bar for access to better tools, to simplify and streamline how people handle data rather than complicate.

I don’t know if it will ever be taken up — or if a better standard still can be suggested — but I do hope this will help dissuade others fro a 16–50–84 approach hereafter.

I would advise anyone interested to just start putting these values into their work instead of the 16–50–84 method.

Two competing standards just makes things more complex and this is about making the world a little bit simpler (and more accurate).

If you do this, but do want to signpost that you’ve treated your data a little differently, how about quoting your values like this:Easy enough to read for anyone but subtly signposting that you’re using this convention.

(And when there’s something citable from my work you could add that too — but I’m not sure the academic world is ready to accept bibtex entries to Medium articles)Perhaps I have you convinced, perhaps not.

There is no one right way for everyone to do anything.

I was shocked at the number of people that I thought were doing it a wrong way — but that’s a very subjective view and even then I rarely believe it invalidates their work or conclusions — it’s just an unfortunate, common and hopefully fixable problem.

A stitch in time saves nine, as the old saying goes.

This stitch is coming a little late — maybe only saving three or four.

But, any small saving of time or accuracy, when played out over the worldwide population of people making, reading and using statistics will save lifetimes.

Play about with the concept and the tools.

Let me know if you think there are better ways still.

Importantly, don’t feel singled out if you are someone who has done this wrong in the past (embarrassingly I think I probably have) — this is a time to be progressive, not defensive.

If you want you can find me on twitter, where I tend to post arcane jokes about non-traversable networks, but sometimes about other science.

This will likely be an evolving document — I welcome anyone's input and don’t be surprised if you come back to find it changed (hopefully for the better).

The version you’re reading is the firstest of drafts .

.

. More details

Leave a Reply