MCMC Intuition for Everyone

Can you think of a way?Think….

MCMC provides us with ways to sample from any probability distribution.

Why would you want to sample from a distribution?Sometimes getting the maximum a posteriori(MAP) estimate while making an inference is infeasible.

When making a prediction, we use the posterior mean as the estimate instead of the MAP parameters.

We can estimate the posterior mean by taking a sample of the posterior distribution.

A time series package called Prophet uses MCMC sampling at inference time to generate estimates.

Don’t worry if you don’t understand this yet.

According to Wikipedia:Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its stationary distribution.

The state of the chain after a number of steps is then used as a sample of the desired distribution.

The quality of the sample improves as a function of the number of steps.

So let’s explain this with an example:Assume that we want to sample from a Beta distribution.

The PDF for the beta is:where C is the normalizing constant (which we don’t need to Sample from the distribution as we will see later).

This is a somewhat tricky problem with the Beta Distribution if not intractable.

In reality, you might need to work with a lot harder Distribution Functions, and sometimes you won’t know the normalizing constants.

MCMC methods make life easier for us by providing us with algorithms that could create a Markov Chain which has the Beta distribution as its stationary distribution given that we can sample from a uniform distribution(which is relatively easy).

If we start from a random state and traverse to the next state based on some algorithm repeatedly, we will end up creating a Markov Chain which has the Beta distribution as its stationary distribution and the states we are at after a long time could be used as a sample from the Beta Distribution.

One such MCMC Algorithm is the Metropolis-Hastings AlgorithmMetropolis-Hastings AlgorithmHill ClimbingIntuition:First, what’s the goal?Intuitively, what we want to do is to walk around on some (lumpy) surface(our Markov chain) in such a way that the amount of time we spend in each location is proportional to the height of the surface at that location(our desired pdf from which we need to sample).

So, e.

g.

, we’d like to spend twice as much time on a hilltop that’s at an altitude of 100m as we do on a nearby hill that’s at an altitude of 50m.

The nice thing is that we can do this even if we don’t know the absolute heights of points on the surface: all we have to know are the relative heights.

e.

g.

, if one hilltop A is twice as high as hilltop B, then we’d like to spend twice as much time at A as we spend at B.

There are more complicated schemes for proposing new locations and the rules for accepting them, but the basic idea is still:(1) pick a new “proposed” location;(2) figure out how much higher or lower that location is compared to your current location;(3) probabilistically stay put or move to that location in a way that respects the overall goal of spending time proportional to the height of the location.

The goal of MCMC is to draw samples from some probability distribution without having to know its exact height at any point(We don’t need to know C).

If the “wandering around” process is set up correctly, you can make sure that this proportionality (between time spent and the height of the distribution) is achieved.

Algorithm:Let us define the problem more formally now.

Let s=(s1,s2,….

,sM) be the desired stationary distribution.

We want to create a Markov Chain that has this stationary distribution.

We start with an arbitrary Markov Chain with M states with transition matrix P, so that pij represents the probability of going from state i to j.

Intuitively we know how to wander around this Markov Chain, but this Markov Chain does not have the required Stationary Distribution.

This chain does have some stationary distribution(which is not of our use)Our Goal is to change the way we wander on the this Markov Chain so that this chain has the desired Stationary distribution.

To do this, we:Start at a random initial State i.

Randomly pick a new Proposal State by looking at the transition probabilities in the ith row of the transition matrix P.

Compute a measure called the Acceptance Probability which is defined as: aij=min(sj.

pji/si.

pij,1).

Now Flip a coin that lands head with probability aij.

If the coin comes up heads, accept the proposal, i.

e.

move to next state else reject the proposal, i.

e.

stay at the current state.

Repeat for a long timeAfter a long time, this chain will converge and will have a stationary distribution s.

We can then use the states of the chain as the sample from any distribution.

While doing this to sample the Beta Distribution, the only time we are using the PDF is to find the acceptance probability, and in that, we divide sj by si, i.

e.

, the normalizing constant C gets canceled.

Sampling from Beta Distribution:MCMC SamplerNow let us move on to the problem of sampling from Beta Distribution.

Beta Distribution is a continuous Distribution on [0,1] and it can have infinite states on [0,1].

Let us assume an arbitrary Markov Chain P with infinite states on [0,1] having transition Matrix P such that pij=pji=All entries in Matrix.

We don’t need the Matrix P as we will see later, But I want to keep the problem description as close to the algorithm we suggested.

Start at a random initial State i given by Unif(0,1).

Randomly pick a new Proposal State by looking at the transition probabilities in the ith row of the transition matrix P.

Let us say we pick up another Unif(0,1) state as a proposal state j.

Compute a measure called the Acceptance Probability :Which simplifies to:since pji=pij, and whereNow Flip a coin that lands head with probability aij.

If the coin comes up heads, accept the proposal, i.

e.

move to next state else reject the proposal, i.

e.

stay at the current state.

Repeat for a long timeCode:So enough with the theory, let us Move on to python to create our Beta Sampler.

Let us check our results of the MCMC Sampled Beta distribution against the actual beta distribution.

As we can see, our sampled beta values closely resemble the beta distribution.

And thus our MCMC Chain has reached the stationary state.

We did create a beta sampler in the above code, but the same concept is universally applicable to any other distribution we want to sample from.

ConclusionThat was a big post.

Congrats if you reached the end.

In Essence, MCMC Methods might be complicated, but they provide us with a lot of flexibility.

You can sample any distribution function using MCMC Sampling.

They usually are used to sample the posterior distributions at the inference time.

You can also use MCMC to Solve problems with a large state space.

For Example, Knapsack Problem Or decryption.

I will work on providing you with more fun examples in my next blog post.

Keep tuned.

One of the newest and best resources that you can keep an eye on is the Bayesian Methods for Machine Learning course in the Advanced machine learning specializationI am going to be writing more of such posts in the future too.

Let me know what you think about the series.

Follow me up at Medium or Subscribe to my blog to be informed about them.

As always, I welcome feedback and constructive criticism and can be reached on Twitter @mlwhiz.

ReferenceIntroduction to Probability Joseph K Blitzstein, Jessica HwangWikipediaStackExchange.. More details

Leave a Reply