Statistics is the Grammar of Data Science — Part 3/5

Statistics is the Grammar of Data Science — Part 3/5Statistics refresher to kick start your Data Science journeySemi KoenBlockedUnblockFollowFollowingFeb 2This is the 3rd article of the ‘Statistics is the Grammar of Data Science’ series, covering Measures of location (percentiles and quartiles) and Moments.

RevisionBookmarks to the rest of the articles for easy access:Article SeriesPart 1: Data Types | Measures of Central Tendency | Measures of VariabilityPart 2: Data DistributionsPart 3: Measures of Location | Moments ????Part 4: Covariance | CorrelationPart 5: Conditional Probability | Bayes’ TheoremMeasures of LocationPercentilesPercentiles divide ordered data into hundredths.

In a sorted dataset, a given percentile is the point at which that percent of the data is less than the point we are at.

The 50th percentile is pretty much the median.

For instance, imagine the growth chart of baby girls from birth until 2 years old.

By following the lines, we can see that 98% of the one year old baby girls weigh less than 11.

5Kg.

Girls’ growth chart.

Courtesy: World Health Organisation Child Growth StandardsAnother popular example is a country’s income distribution.

The 99th percentile is the income at which 99% of the rest of the country is making less than that amount, and 1% is making more.

In the case of the UK on the graph below, this is £75,000.

UK income distribution.

Courtesy: WikipediaQuartilesQuartiles are special percentiles, which divide the data into quarters.

The first quartile, Q1, is the same as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile.

The median is called both the second quartile, Q2, and the 50th percentile.

Interquartile Range (IQR)The IQR is a number that indicates how spread the middle half (i.

e.

the middle 50%) of the dataset is and can help determine outliers.

It is the difference between the Q3 and Q1.

IQR = Q3 – Q1IQR.

Courtesy: WikipediaGenerally speaking, outliers are those data points that fall outside from the Q1 – 1.

5 x IQR and Q3 + 1.

5 x IQR range.

Box PlotsBox plots (also called box and whisker plots) illustrate:how concentrated the data is, andhow far the extreme values are from most of the data.

Elements of a boxplot.

Courtesy: WikimediaA box plot is comprised of a scaled horizontal or vertical axis and a rectangular box.

The minimum and maximum values are the endpoints of the axis (-15 and 5 in this case).

The Q1 marks one end of the box and the Q3 the other end of the blue box.

The ‘whiskers’ (shown in purple) extend from the ends of the box to the smallest and largest data values.

There are also box plots that have dots marking outlier values (shown in red).

In those cases, the whiskers are not extending to the minimum and maximum values.

✏️ Boxplots on a Normal DistributionThere is a subtle nuance with boxplots on normal distributions: Even though they are called quartile 1 (Q1) and quartile 3 (Q1), they don’t really represent 25% of the data!.They represent 34.

135%, and the area in between is not 50%, but 68.

27%.

Comparison of a boxplot of a nearly normal distribution (top) and a PDF for a normal distribution (bottom).

Courtesy: WikipediaMomentsMoments describe various aspects of the nature and shape of our distribution.

#1 — The first moment is the mean of the data, which describes the location of the distribution.

#2 — The second moment is the variance, which describes the spread of the distribution.

High values are more spread out than smaller values.

#3 — The third moment is the skewness and it is basically a measure of how lopsided a distribution is.

A positive skew means we have a left lean and a long right tail.

This means that the mean is to the right of the bulk of our data.

And vice versa:Skewness.

Courtesy: Wikipedia#4 — The fourth moment is the kurtosis, which describes how thick the tail is and how sharp the peak is.

It indicates how likely it is to find extreme values in our data.

Higher values make outliers more likely.

This sounds a lot like spread (variance) but is subtly different.

Kurtosis illustration of three curves.

Courtesy: WikipediaWe can see that the higher peak values have a higher kurtosis value, i.

e.

the topmost curve has a higher kurtosis than the bottommost curve.

That’s all folks!.This was a rather short article; we learnt how important the percentiles are, as they indicate where we stand in relation to everyone else.

Then we saw a special category, called quartiles, and their application into finding outliers.

Finally, we explored the four ‘moments’ which describe a curve’s shape.

Thanks for reading!.Part 4 is coming soon…I regularly write about Technology & Data on Medium — if you would like to read my future posts then please ‘Follow’ me!.. More details

Leave a Reply