Simpson’s paradox highlights one of my favourite things about data: the need for good intuition regarding the real world and how most data is a finite dimensional representation of a much larger, much more complex domain..Simpson’s paradox showcases the importance of skepticism and interpreting data with respect to the real world, and also the dangers of oversimplifying a more complex truth by trying to see the whole story from a single data-viewpoint.The paradox is relatively simple to state, and is often a cause of confusion and misinformation for non-statistically trained audiences:Simpson’s Paradox: A trend or result that is present when data is put into groups that reverses or disappears when the data is combined.One of the most famous examples of Simpson’s paradox is UC Berkley’s suspected gender-bias..What happens if we split our data up by sex?This suggests that 84.4% of men and 40% of women liked ‘Sinful Strawberry’ whereas 85.7% of men and 50% of women liked ‘Passionate Peach’..This is an example of Simpson’s Paradox!Our intuition tells us that the flavour that is preferred both when a person is male or female should also be preferred when their sex is unknown, and it is pretty strange to find out that this is not true — this is the heart of the paradox.Lurking variablesSimpson’s paradox arises when there are hidden variables that split data into multiple separate distributions..It is definitely important to know whether or not we’re looking at poorly sampled data, or a real case of the paradox..If we wanted to know about a hospital’s survival rate; we should probably split up our data to look at categorised groups of people who arrive at the hospital with different illnesses..In every situation, the key is to interpret the data in relation to the underlying domain, and to take the most appropriate data-viewpoint.That’s a wrap — thanks for reading!If you enjoyed this post on Simpson’s paradox and interpreting data through data-viewpoints, feel free to get in touch with me (Tom Grigg) regarding any thoughts or queries!. More details
- 7 Data Trends for 2020 (and one non-trend)
- What are Autoencoders? Learn How to Enhance a Blurred Image using an Autoencoder!
- Introducing Databricks Ingest: Easy and Efficient Data Ingestion from Different Sources into Delta Lake
- New Data Ingestion Network for Databricks: The Partner Ecosystem for Applications, Database, and Big Data Integrations into Delta Lake