Behind The Models: Dirichlet — How Does It Add To 1?Building Blocks For Non-Parametric Bayesian ModelsTony PistilliBlockedUnblockFollowFollowingJun 18In a previous article I presented the Dirichlet distribution as a combination of many Beta-distributed variables which add to 1.

0 — this can be useful for applications where you need a “random” classifier, which is the subject of an article still in the works.

Sebastian Bruijns asked the obvious question I skirted around in the original article:Very nice and understandable article.

I just don’t understand the connection between Dirichlet and Beta, you wrote that the variates of the Dirichlet follow a Beta dist.

, but how does that work, how is it guaranteed to add up to one?Here’s the nitty-gritty:BackgroundThis section is mostly review of the previous article — skip over it if you’re familiar with those details.

The Dirichlet distribution is a multivariate generalization of the Beta.

The input is a vector of 2 or more parameters, and the output is a distribution in which the sum of the variables always equals 1.

0 and each individual variable is Beta-distributed.

The “Stick Breaking” in the image below will unite these two ideas.

The intuitive way of generating the Dirichlet after reading the above is to generate random Betas and sum them up, but that clearly won’t work.

In the simulation below we show that the sum of 5 Beta(3, 7)’s is something normal-looking (it’s not exactly normal, in part because the Beta is bound at [0, 1], so the sum is bound at [0, 5]).

import numpy as npfrom matplotlib import pyplot as pltn = 100000z = [np.

random.

beta(3,7,n) for i in range(5)]for i in range(5): plt.

hist(z[i], histtype = u’step’, bins = 300) plt.

show()z_sum = [sum([z[x][i] for x in range(5)]) for i in range(n)]plt.

hist(z_sum, histtype = u’step’, bins = 300)plt.

show()Left: a histogram of 5 Beta(3, 7) variables; Right: a histogram of the sum of the 5 BetasSo How Does The Dirichlet Add To 1The PDF for the Dirichlet distribution is below.

Here θ is the multinomial category, and α is the vector or β parameters.

The Betas are always Beta(α, 1) — there is no second parameter.

Note that the resulting Beta-distributed variates will not follow a Beta(α, 1) — the β paramater is a function of the α vector.

That may start to give some intuition about why the Dirichlet adds to 1 — as the alphas grow bigger.

The resulting Beta-distributed variates follow Beta(α, (K-1)*α) where K is the number of values in the α vector.

Have a go at it yourself:x = np.

random.

dirichlet((1, 1, 1), 100000)for n in range(3): a, b, l, s = beta.

fit(x[:,n], floc = 0, fscale = 1) print(a) print(b)The form of the PDF for the Dirichlet gives a first glimpse at how it ensures that the Betas add up to 1.

0.

Here θ is the multinomial category, and α is the vector we’ve been working with.

A practical implementation uses the Gamma distribution (recall that relationship between the Beta/Gamma: B(α,β)=Γ(α)Γ(β) / Γ(α+β)) and is easy to intuit.

First draw an independent Gamma(α, 1) for each Dirichlet variate, then average them out to 1.

0 to produce Dirichlet-distributed variables.

a = 3n_ = 5y = [np.

random.

gamma(a, 1, 100000) for i in range(n_)]y_sum = [sum([y[x][i] for x in range(n_)]) for i in range(100000)]x = [[y[x][i] / y_sum[i] for x in range(n_)] for i in range(100000)]x = [[x[i][n] for i in range(100000)] for n in range(n_)]for i in range(n_): plt.

hist(x[i], histtype = u'step', bins = 300, normed = True)a, b, l, s = beta.

fit(x[0], floc = 0, fscale = 1)As an aside, this also affords a great opportunity to use some robust list-comprehension: I’ve been salivating over this article for about a month.

The Infinite Case — The Dirichlet processIn the previous article we arrived at a Dirichlet process, which is slightly different than Dirichlet distribution.

A Dirichlet distribution can be expanded to have an infinite number of variates via a Dirichlet process.

This is another fact that I waved my hands about that might have offended your intuitions.

The way to think about a Dirichlet process is that each pull from it is itself Dirichlet-distributed (not Beta distributed, though of course draws from a Dirichlet distribution are Beta distributed).

So unlike the Dirichlet distribution above where each variateThe process is possible because we can iteratively generate a new variate, so we don’t actually generate infinite variates, but we create a framework where we could always create more variates if needed.

In this case we can’t use the Gamma trick from above because we can’t divide by the total sum, but we can be assured that if we went onto infinity we’d never end up with variates adding up to more than 1.

The formula is reproduced below — p(i) is the Dirichlet variate — V(i) is an intermediary variable.

Below we generate the first 3 variates in a Dirichlet process:k = 3z = [np.

random.

beta(1,1,100000) for i in range(k)]p = [[np.

prod([1 – z[x][n] for x in range(i)]) * z[i][n] for i in range(k)] for n in range(100000)]p = [[p[i][x] for i in range(100000)] for x in range(k)]for i in range(k): plt.

hist(p[i], histtype = u'step', bins = 300, normed = True) a, b, l, s = beta.

fit(p[i], floc = 0, fscale = 1) print(a) print(b)plt.

show()p_sum = [sum([p[x][i] for x in range(k)]) for i in range(100000)]plt.

hist(p_sum, histtype = u'step', bins = 300, normed = True)plt.

show()Left: graph of Dirichlet-distributed variates that are components of Dirichlet process; Right: sum of Dirichlet-distributed variatesThe first Beta draw is in blue in the left graph: Beta(1,1).

In the second draw in orange, we use Beta(1,1), but then multiply it by (1-V(1)), so the final variable is Beta-looking.

The green line is the third draw — Beta continues to increase as we shift more mass into the left hand tail.

This makes intuitive sense — using α = 1 we’re allowing relatively high values to come out early in the draws, so the remaining draws get forced into lower numbers.

The right graph tells us that with only three draws we’re already hugging pretty close to 1.

0, so remaining draws are going to have very small margins to fit into.

If we use α = 0.

5 above we force the early draws to stick to lower values, and so have more freedom for laterLeft: graph of Dirichlet-distributed variates that are components of Dirichlet process; Right: sum of Dirichlet-distributed variatesConclusionLike the past article, this article was theory heavy.

Hopefully it helped fill in some gaps that I glossed over in the previous article.

In future articles I will use these models in application heavy examples.

.