Probability Distributions Every Data Scientist Should Know

Here’s your reward.

Source: pixabay.

Now that you know what a probability distribution is, let’s learn about some of the most common ones!Bernoulli Probability DistributionA Random Variable with a Bernoulli Distribution is among the simplest ones.

It represents a binary event: “this happened” vs “this didn’t happen”, and takes a value p as its only parameter, which represents the probability that the event will occur.

A random variable B with a Bernoulli distribution with parameter p will have the following density function:P(B = 1) = p, P(B =0)= (1-p)Here B=1 means the event happened, and B=0 means it didn’t.

Notice how both probabilities add up to 1, and therefore no other value for B will be possible.

Uniform Probability DistributionThere are two kinds of uniform random variables: discrete and continuous ones.

A discrete uniform distribution will take a (finite) set of values S, and assign a probability of 1/n to each of them, where n is the amount of elements in S.

This way, if for instance my variable Y was uniform in {1,2,3}, then there’d be a 33% chance each of those values came out.

A very typical case of a discrete uniform random variable is found in dice, where your typical dice has the set of values {1,2,3,4,5,6}.

A continuous uniform distribution, instead, only takes as parameters, and assigns the same density to each value in the interval between them.

That means the probability of Y taking a value in an interval (from c to d) is proportional to its size versus the size of the whole interval ( b-a).

Therefore if Y is uniformly distributed between a and b, thenThis way, if Y is a uniform random variable between 1 and 2,Python’s random package's random method samples a uniformly distributed continuous variable between 0 and 1.

Interestingly, it can be shown that any other distribution can be sampled given a uniform random values generator and some calculus.

Normal Probability DistributionNormal Distributions.

source: WikipediaNormally distributed variables are so commonly found in nature, they’re actually .

That’s actually where the name comes from.

If you round up all your workmates and measure their heights, or weigh them all and plot a histogram with the results, odds are it’s gonna approach a normal distribution.

I actually saw this effect when I showed you Exploratory Data Analysis examples.

It can also be shown that if you take a sample of any random variable and average those measures, and repeat that process many times, that average will also have a normal distribution.

That fact’s so important, it’s called the fundamental theorem of statistics.

Normally distributed variables:Are symmetrical, centered around a mean (usually called μ).

Can take all values on the real space, but only deviate two sigmas from the norm 5% of the time.

Are literally everywhere.

Most often if you measure any empirical data and it’s symmetrical, assuming it’s normal will kinda work.

For example, rolling K dice and adding up the results will distribute pretty much normally.

Log-Normal Probability DistributionLognormal distribution.

source: WikipediaLog-normal probability distribution is Normal Probability Distribution’s smaller, less frequently seen sister.

A variable X is said to be log-normally distributed if the variable Y = log(X) follows a normal distribution.

When plotted in a histogram, log-normal probability distributions are asymmetrical, and become even more so if their standard deviation is bigger.

I believe lognormal distributions to be worth mentioning, because most money-based variables behave this way.

If you look at the probability distributions of any variable that relates to money, likeAmount sent on the latest transfer of a certain bank.

Volume of the latest transaction in Wall Street.

A set of companies’ quarterly earnings for a given quarter.

They will usually not have a normal probability distribution, but will behave much closer to a lognormal random variable.

(For other Data Scientists: chime in in the comments if you can think of any other empirical lognormal variables you’ve come across in your work! Especially anything outside of finances).

Exponential Probability DistributionSource: WikipediaExponential probability distributions appear everywhere, too.

They are heavily linked to a Probability concept called a Poisson Process.

Stealing straight from Wikipedia, a Poisson Process is “ a process in which events occur continuously and independently at a constant average rate “.

All that means is, if:You have a lot of events going.

They happen at a certain rate (which does not change over time).

Just because one happened the chances of another one happening don’t change.

Then you have a Poisson process.

Some examples could be requests coming to a server, transactions happening in a supermarket, or birds fishing in a certain lake.

Imagine a Poisson Process with a frequency rate of λ (say, events happen once every second).

Exponential random variables model the time it takes, after an event, for the next event to occur.

Interestingly, in a Poisson Process an event can happen anywhere between 0 and infinity times ( with decreasing probability), in any interval of time.

This means there’s a non-zero chance that the event won’t happen, no matter how long you wait.

It also means it could happen a lot of times in a very short interval.

In class we used to joke bus arrivals are Poisson Processes.

I think the response time when you send a WhatsApp message to some people also fits the criteria.

However, the λ parameter regulates the frequency of the events.

It will make the expected time it actually takes for an event to happen center around a certain value.

This means if we know a taxi passes our block every 15 minutes, even though theoretically we could wait for it forever, it’s extremely likely we won’t wait longer than, say, 30 minutes.

Exponential Probability Distribution: In PracticeHere’s the density function for an exponential distribution random variable:Suppose you have a sample from a variable and want to see if it can be modelled with an Exponential distribution Variable.

The optimum λ parameter can be easily estimated as the inverse of the average of your sampled values.

Exponential variables are very good for modelling any probability distributions with very infrequent, but huge (and mean-breaking) outliers.

This is because they can take any non-negative value but center in smaller ones, with decreased frequency as the value grows.

In a particularly outlier-heavy sample, you may want to estimate λ as the median instead of the average, since the median is more robust to outliers.

Your mileage may vary on this one, so take it with a grain of salt.

ConclusionsTo sum up, as Data Scientists, I think it’s important for us to learn the basics.

Probability and Statistics may not be as flashy as Deep Learning or Unsupervised Machine Learning, but they are the bedrock of Data Science.

Especially Machine Learning.

Feeding a Machine Learning model with features without knowing which distribution they follow is, in my experience, a poor choice.

It’s also good to remember the ubiquity of Exponential and Normal Probability Distributions, and their smaller counterpart, the lognormal distribution.

Knowing their properties, uses and appearance is game-changing when training a Machine Learning model.

It’s also generally good to keep them in mind while doing any kind of Data Analysis.

Did you find any part of this article useful?.Was it all stuff you already knew?.did you learn anything new?.Let me know in the comments!Contact me on Twitter, Medium of dev.

to if there’s anything you don’t think was clear enough, anything that you disagree with, or just anything that’s plain wrong.

Don’t worry, I don’t bite.

Originally published at http://www.

datastuff.

tech on June 17, 2019.

.. More details

Leave a Reply