Statistics is the Grammar of Data Science — Part 1

????Machine Learning libraries like Tensorflow or scikit-learn hide almost all the complex mathematics away from the user.

That means that we don’t need to be experts in maths, but it’s definitely a necessity to have a basic understanding of the fundamental principles; it will help us utilise these libraries better.

I am starting a series of 5 short articles that will cover the following topics to kick start, and later accompany, our Data Science journey:Part 1: Data Types | Measures of Central Tendency | Measures of VariabilityPart 2: Data DistributionsPart 3: Measures of Location| MomentsPart 4: Covariance | CorrelationPart 5: Conditional Probability | Bayes’ TheoremLet’s start with part 1️⃣…Data TypesWe cannot go more basic than this: Data is split in three categories, based on which a Data Scientist chooses how to further analyse and process it:#1.

Numerical data represents some quantifiable information that is measurable and is further divided into two subcategories:Discrete data, which is integer based (e.


number of people)Continuous data, which is decimal based (e.


price, distance, temperature).


Categorical data is qualitative data that is used to classify data into categories (think of an enumeration in programming).

For example, gender, car brands, country of residence etc.

Sometimes we can assign numbers to the categories so they are more compact, but they don’t have any mathematical meaning.


Ordinal data represents discrete and ordered units, e.


champions league rank (1st, 2nd, 3rd), bug priority (low, critical or showstopper), or hotel rating (1–5*).

Measures of Central TendencyLet’s assume we have a dataset of 5 numbers:{ 6, 3, 100, 3, 13 }MeanThe mean (represented by the greek letter mu— μ) is the average of a dataset.

To calculate the mean, we sum up all the values and divide it by the number of values.



6 + 3 + 100 + 3 + 13 = 125 → μ = 125 ÷ 5 = 25MedianThe median is the middle of a dataset.

To calculate the median, we sort all the values (in ascending or descending order) and take the one that is in the middle.



3, 3, 6, 13, 100 → 6If there is an even number of data points, then we calculate the mean of the two that fall in the middle.

The median is less susceptible to outliers than the mean, and hence we need to take into consideration how the data distribution looks like, to choose which one to use.

ModeThe mode is the most common value in the dataset.

To calculate the mode, we locate the number that occurs more frequently.



3:2, 6:1, 13:1, 100:1 → 3Mode is usually only relevant to discrete numerical data — not to continuous data.

Measures of VariabilityRangeRange is the difference between the lowest and the highest number of a dataset.

To calculate the range, we subtract the minimum from the maximum value.



100 – 3 = 97It shows us how varied the dataset is, i.


how spread it is, but again, like median, it is really sensitive to outliers.

VarianceVariance measures how spread out the data is.

To calculate the variance, we take the average of the squared differences from the mean.


Find the mean of the data pointsFrom previous section it is 25#2.

Subtract the mean from each data point6 – 25 = -193 – 25 = -22100 – 25 = 753 – 25 = -2213 – 25 = -12#3.

Square each result(-19)^2 = 361(-22)^2 = 484(75)^2 = 5,625(-22)^2 = 484(-12)^2 = 144#4.

Find the mean of the results (i.


sum up and divide by n)361 + 484 + 5,625 + 484 + 144 = 7,098 →7,098 ÷ 5 = 1,419.

6On the 3rd step the reason we use the square of the difference is twofold:negative differences have the same impact as positive differences i.


they won’t cancel each other outit amplifies the effect the outliers have in the dataset.

✏️ Data CompletenessThere is a subtle distinction for Step #4 depending on how complete our dataset is:For full population, we divide by the number of data points (n) — i.


Step #4 was correct, as in our case we had a full populationFor samples, we divide by the number of data points minus 1 (n – 1)7,098 ÷ 4 = 1774.

5Standard DeviationStandard deviation (represented by the greek letter sigma — σ) is just the square root of the variance.



σ = SQRT(1,419.

6) = 37.

68It is used to judge which data point is an outlier, in terms of how many standard deviations it is away from the mean.

In our case, that value 100 is an outlier:μ = 25σ = 37.

68Outliers (upper): 25 + 37.

68 = 62.

68Outliers (lower): 25 – 37.

68 = -12.

68So values greater than 62.

68 and lower than -12.

68 are considered outliers.

Thanks for reading!.Part 2 is coming soon…I regularly write about Technology & Data on Medium — if you would like to read my future posts then please ‘Follow’ me!.

. More details

Leave a Reply