Why do machine learning specialists and data scientists need Bayesian statistics?Bayesian Statistics vs Frequentist StatisticsFor those of you with no idea what the terms Bayesian and frequentist are, let me elaborate.
A frequentist approach looks at data from the point of view of frequency.
For example, let’s say I have a biased coin with heads on both sides.
I flip the coin 10 times, and 10 times I get heads.
If I take the average result of all the coin flips, I get 1, indicating that my next flip will have a 100% chance of being heads, and a 0% chance of being tails, this is a frequentist way of thinking.
Now take the Bayesian point of view.
I start out with a prior probability which I will choose to be 0.
5 because I am assuming the coin is fair.
However, what is different is how I choose to update my probability.
After each coin flip, I will look at how likely my next observation is given my current belief (that I have a fair coin).
Progressive, as I flip more heads, my probability will tend towards a value of 1, but it will never be explicitly 1.
The fundamental difference between the Bayesian and frequentist approach is about where the randomness is present.
In the frequentist domain, the data is considered random and the parameters (e.
g.
mean, variance) are fixed.
In the Bayesian domain, the parameters are considered random and the data is fixed.
I really want to stress one point right now.
It is not called Bayesian because you are using the Bayes theorem (which is commonly used also in a frequentist perspective).
It is called Bayesian because the terms in the equations have a different underlying meaning.
Then, from a theoretical difference, you end up with a very meaningful practical difference: while before you had just a single parameter as a result of your estimator (the data is random, the parameters are fixed), now you have a distribution over the parameters (the parameters are random, the data are fixed), so you need to integrate to obtain the distribution over your data.
This is one reason the mathematics behind Bayesian statistics gets a bit messier than normal statistics, and one must resort to using Markov Chain Monte Carlo methods to sample from distributions in order to estimate the value of intractable integrals.
Other nifty techniques, such as the Law Of The Unconscious Statistician (what a great name, right?), aka.
LOTUS can help with the mathematics.
So which methodology is better?These methods are essentially two sides of the same coin (pun intended), they typically give you the same results but the way they get there is slightly different.
Neither is better than the other.
In fact, I even have professors in my classes at Harvard that frequently argue over which is better.
The general consensus is that ‘it depends on the problem’ if one can consider that a consensus.
Personally, I find the Bayesian approach more intuitive but the underlying mathematics is far more involved than the traditional frequentist approach.
Now that you (hopefully) understand the difference, perhaps the below joke will make you chuckle.
Bayesian vs frequentist joke.
When should I use Bayesian statistics?Bayesian statistics encompasses a specific class of models that could be used for machine learning.
Typically, one draws on Bayesian models for one or more of a variety of reasons, such as:Having relatively few data pointsHaving strong prior intuitions (from pre-existing observations/models) about how things workHaving high levels of uncertainty, or a strong need to quantify the level of uncertainty about a particular model or comparison of modelsWanting to claim something about the likelihood of the alternative hypothesis, rather than simply accepting/rejecting the null hypothesisLooking at this list, you might think that people would want to use Bayesian methods in machine learning all of the time.
However, that’s not the case, and I suspect the relative dearth of Bayesian approaches to machine learning is due to:Most machine learning is done in the context of “big data” where the signature of Bayesian models — priors — don’t actually play much of a role.
Sampling posterior distributions in Bayesian models is computationally expensive and slow.
As we can see clearly, there is so much synergy between the frequentist and Bayesian approaches, especially in today’s world where big data and predictive analytics have become so prominent.
We have loads and loads of data for a variety of systems, and we can constantly make data-driven inferences about the system and keep updating them as more and more data becomes available.
Since Bayesian statistics provides a framework for updating “knowledge”, it is, in fact, used a whole lot in machine learning.
Several machine learning techniques, such as Gaussian processes and simple linear regression, have Bayesian and non-Bayesian versions.
There are also algorithms that are purely frequentist (e.
g.
support vector machines, random forest), and those that are purely Bayesian (e.
g.
variational inference, expectation maximization).
Learning when to use each of these and why is what makes you a real data scientist.
Are you a Bayesian or a Frequentist at heart?Personally, I am not in one camp or another, this is because sometimes I am using statistics/machine learning on a dataset with thousands of features, of which I know nothing about.
Thus, I have no prior belief and Bayesian inference seems inappropriate.
However sometimes I have a small number of features and I know quite a lot about them and I would like to incorporate that within my model — in which case Bayesian methods will give me more conclusive intervals/results that I trust.
Where should I go to learn more about Bayesian statistics?There are several great online classes that delve deep into Bayesian statistics for machine learning.
The best resource I would recommend is the class I took here at Harvard, AM207: Advanced Scientific Computing (Stochastic Optimization Methods, Monte Carlo Methods for Inference and Data Analysis).
You can find all the lecture resources, notes, and even Jupyter notebooks running through the techniques here.
Here is also a great video which talks about converting between Bayesian and frequentist domains (go to around 11 minutes in the video).
If you want to become a really great data scientist, I would suggest you get a firm grip on Bayesian statistics and how it can be used to solve problems.
The journey is difficult and it is a steep learning curve, but it is a great way to separate yourself from other data scientists.
From discussions I have had with colleagues going for data science interviews, Bayesian modeling is something that comes up pretty often, so keep that in mind!.