Calling Bullshit in Data Analytics

Humans tend to assume that if event A precedes event B in time, then event A must be the cause of event B.

This fallacy is known as “post hoc ergo propter hoc” (after this, therefore because of this).

The following comic from xkcd highlights the absurdity of using this fallacy to imply direction of causality.

In the comic, Black Hat takes the growth in cancer followed by growth in cell phones as evidence that cancer causes cell phones (an absurd proposition, of course).

Source: xkcdTip # 8: An event preceding another in time does not establish it as a cause of the later event.

One of the most used model in predictive statistics is linear regression.

We fit a straight line based on the currently available data and then extrapolate it to make predictions.

Extrapolating far into the future, however, can be problematic if the prediction model does not take into account the real life constraints of the process being predicted.

As an example, consider the following paper published in Nature (a leading academic research journal).

The paper uses data of male/female 100-metre sprint winning times in Olympics to claim that soon women sprinters will overtake men.

Extrapolating the male (blue line) and female (red line) linear regression, they predict that the female winning time will be less than the male one in the year 2156.

Source: NatureThis approach is problematic for multiple reasons.

Firstly, the model uses data of about 100 years to make a prediction more than 150 years in the future.

We have no reason to believe that the trends being followed in the past century would continue for the next century and a half too.

Secondly, the model fails to take into account the limitations of the physical process being predicted.

It is close to impossible for a human to run a standard 100-metre sprint in less than 9 seconds.

Taking this into account, the declining trend in times should taper off instead of declining to about 6 seconds as shown in the graph.

Source: xkcdTip # 9: Do not extrapolate beyond 10% of the range of data on which model is fitted* (unless model accounts for underlying physical/economic process)Big Data & Academic Research — Vanguards of Truth?Over the past decade, the term big data has increased in popularity.

This has been accompanied by the idea that big data will overcome all the shortcomings of conventional data and statistics.

Further, combined with machine learning and artificial intelligence, it will allow us to predict things that couldn’t be predicted before.

But does big data really give us this perfect access to truth — current and future?The example of Google Flu Trends (GFT) suggests otherwise.

GFT was created in 2008 to predict outbreak of Influenza-Like Illness (ILI) using aggregated Google search queries.

In the US, Centre for Disease Control and Prevention (CDC) publishes its own estimates of flu outbreaks using data from laboratories with a lag of about 2 weeks.

Initially Google claimed the GFT predictions to be 97% accurate when compared with CDC estimates.

Post-2011, however, GFT started over-estimating flu prevalence.

In Feb 2013, Nature reported that in 2012–13, GFT estimates were more than double the CDC estimates.

The problem of selecting search queries related to flu and not to anything else meant that despite tweaks, the algorithm continued to over-predict.

This performance of GFT was so bad that a simple lagged model using CDC data would have out-performed GFT in prediction accuracy.

Source: ScienceTip # 10: Big data is not a magical bullet for all complex problems.

It has its limitations.

Over centuries, academic research has pushed the boundaries of human knowledge.

It is no wonder then that research articles published in peer-reviewed academic journals of repute are taken as gospels of truth.

Yet at the same time we come across articles that contradict each other’s findings.

Source: Public Library of ScienceThe issue lies with how empirical research is conducted and selected for publication.

When a hypothesis is proven using empirical data, an associated p-value shows how significant the conclusion is (smaller the p-value, higher the significance).

Most journals would require a p-value of 0.

05 or less to publish a paper (though it does vary).

This has led to a situation where research methodology, variable selection and sample selection are manipulated to prove a biased hypothesis as true using a favourable p-value.

The academic community is cognisant of the issue as evidenced by the article quoted above and this piece in Nature.

The xkcd comic below as well as this hilarious one on sub-group analysis are not far from how biased manipulations are done in real life to get a favourable p-value to cross the publication barrier.

Source: xkcdTo see for yourself how such manipulations can impact the outcome of a study, check out this demo by FiveThirtyEight on how the p-value can be “hacked”.

Just by varying which variables you choose to include in your model and which factors you choose to account for or ignore, you can have the result of your choice with a statistically significant p-value.

Such a study may not be the truth but it will be worthy of being published!Tip # 11: Just because something got published in an academic journal doesn’t mean that it is the absolute truthWhile those who intend to deceive may try to hide the truth under their bullshit, it is still hard to comprehend the truth even when it is in plain sight.

Metaphorically, the truth is like the elephant in the poem below.

Those of us who are trying to figure out the truth, be it through data analysis or otherwise, are like the 6 blind men.

Each of us approaches the truth from a different angle and comes to a different conclusion about its nature based on our approach.

John Godfrey Saxe — The Blind Men and the ElephantWhile all of us may think that we have grasped the true nature of truth, its reality evades all of us.

To quote the poem, “Though each was partly in the right, and all were in the wrong!”.

As someone tasked with uncovering the truth, one should not be too arrogant about their discoveries.

Insisting on our truth being the absolute truth will make us miss out on other equally valid perspectives.

Only once we combine these multiple perspectives will we be able to piece together the reality and come closer to finding truth.

CASE STUDY: Health Warnings & Tobacco UsageHealth warnings in the form of text or graphics were introduced in order to deter people from smoking cigarettes.

Big tobacco firms have a long history of trying to resist government policies that try to implement these health warnings on cigarette packs.

An internal Philip Morris International (PMI) document, leaked by Reuters, lists “arguments and evidence” to make the claim that health warnings are not effective in reducing smoking rates.

Below are some of the claims made in the document along with data and charts given to support the claim.

Try to call out the bullshit in each of these using logical reasoning and some of the tips given above.

Claim: Oversized, shocking warnings do not reduce smoking ratesClaim: Oversized, shocking warnings do not reduce smoking ratesClaim: Many countries with smaller, text-only health warnings have seen greater rates of decline in smoker rates than countries with larger, pictorial health warningsClaim: Countries with smaller, text only warnings have a higher percentage of people thinking about quittingClaim: Oversized, graphic health warnings are not necessary to reduce smoking ratesReferences and Further ReadingHarry Frankfurt, “On Bullshit”Carl Bergstrom and Jevin West, “Calling Bullshit”Fox News, “Food Stamp Program at All-Time High”New York Times, “ What is the Needle?”Amazon, Nestle Hot Cocoa MixMath with Bad Drawings, “What Does Probability Mean in Your Profession”The Conversation, “Music to die for: how genre affects popular musicians’ life expectancy”Tyler Vigen, “Spurious Correlations”xkcd, “Cell Phones” : “Extrapolating” : “Significant” : “P-Values”Nature, “Momentous sprint at the 2156 Olympics?”Nature, “When Google got flu wrong”Nature, “It’s time to talk about ditching statistical significance”Science, “The Parable of Google Flu: Traps in Big Data Analysis”PLOS, “Why Most Published Research Findings Are False”PLOS, “P values in display items are ubiquitous and almost invariably significant: A survey of top science journals”FiveThirtyEight, “Hack Your Way To Scientific Glory”John Godfrey Saxe, “The Blind Men and the Elephant”Reuters, “Pakistan diluted proposed tobacco health warnings after Philip Morris, BAT lobbying”Philip Morris International, “Excessive Health Warnings Toolkit”Kennedy Elliot, “39 studies about human perception in 30 minutes”.. More details

Leave a Reply