Thompson Sampling For Multi-Armed Bandit Problems (Part 1)

Thompson Sampling For Multi-Armed Bandit Problems (Part 1)Using Bayesian Updating For Online Decision MakingTony PistilliBlockedUnblockFollowFollowingMay 31“Multi-armed bandit” is perhaps the coolest term in data science, excluding financial applications to “naked European call options” or “short iron butterflies”.

They are also among the most commonly encountered practical applications.

The term has a helpful motivating story: a “one armed bandit” refers to a slot machine — pull the “arm” and the “bandit” will take your money (most times — with some probability you’ll win big).

A “multi-armed bandit” is a slot machine with many arms, each with potentially a different probability of winning and associated payout.

Without knowing those probabilities/payouts, you’ll need to learn as you go, pulling levers that are successful more often and hopefully converging on the optimal lever sooner rather than later.

Gambling aside, we can imagine multi-armed bandit problems in many situations.

A hospital has several ideas about potential modifications to clinical protocol to improve results — with a limited number of clinical events to intervene on, and potential unintended consequences from poor modifications, how can we evaluate all modifications but ensure the best rise to the top quickly.

Or a marketing firm wants to deploy a new ad strategy by selecting one of 3 potential campaigns it has developed.

In this article we will focus on a Bernoulli bandit problem — one in which the payout is either 1 or 0 (reducing the complexity of the problem to identifying the probability of payout).

We will apply two methods: ε-greedy and Thompson Sampling.

In future articles we will discuss more complex multi-armed bandit problems (including sampling methods where conjugate priors are not as friendly to us!) and other methods for addressing this problem set.

ε-greedyAn algorithm is “greedy” in as much as it chooses an immediate payoff without regard to a long-term outcome.

Tree algorithms are often “greedy” — at each node they chose the split that minimizes entropy the most.

This split may not be the tree which minimizes entropy the most in subsequent splits.

Greedy algorithms are common in situations where the number of paths grows exponentially and soon becomes intractable (i.


computationally intense to the point of impossibility).

For example a tree based on 10 features, even allowing each variable to be used only one time (which is not a common restraint), 3.

6M possible configurations (10! = 10 * 9 * 8…).

Adding the 11th feature makes 39.

9M, etc.

Multi-armed bandit problems can reasonably be approached using a greedy algorithm.

In the slot-machine formulation, we assume all levers have the same probability of success, we randomly choose one, and then update our assumption of success based on success/failure.

At each successive iteration we choose the lever that has the highest probability of success.

This algorithm potentially ignores new information for the sake of immediate gain.

Suppose that we have 3 levers with probabilities 0.

04, 0.

07, and 0.


We may by chance find early success with a lower-probability lever, and then continue pulling that lever 20–50 times until it’s probability drops below the others.

The algorithm gets “stuck”.

An enhancement to this algorithm allows for a small probability of times (ε) where we ignore the probability of success and choose a lever at random.

This helps gather new information by allowing non-optimal choices enter the mix occasionally.

The code below sets up the algorithm for us.

Note that the Beta distribution is a “conjugate prior” to the Bernoulli distribution.

This is a Bayesian term that means that the Beta distribution is the posterior predictive distribution of the probability of success contained in a Bernoulli trial.

Conjugate priors make the math of Bayesian analysis easy — below we can update the Beta by treating alpha as the count of successes and beta as the count of failures.

import numpy as npimport pandas as pdimport random#Initialize Sampling def initialize(arms): prior = pd.


ones((arms, 3)), columns = ['alpha', 'beta', 'theta']) prior.

loc[:, 'theta'] = prior.

loc[:, 'alpha'] / (prior.

loc[:, 'alpha'] + prior.

loc[:, 'beta']) post = pd.

DataFrame(columns = ['i', 'arm', 'payout']) return prior, post# e-Greedydef e_greedy(eta, arm_prob, prior): #Determine path-selection if np.


uniform(0, 1, 1) <= eta: s = np.


randint(0, len(arm_prob) -1, 1)[0] else: s = random.


where(prior['theta'] == np.

amax(prior['theta']))[0]) #Determine success/failure r = 1 if np.


uniform(0, 1, 1) <= arm_prob[s] else 0 prior.

loc[s, 'alpha'] += r prior.

loc[s, 'beta'] += 1- r prior.

loc[s, 'theta'] = prior.

loc[s, 'alpha'] / (prior.

loc[s, 'alpha'] + prior.

loc[s, 'beta']) return s, r, priorWe’ll try this code out head-to-head with Thompson sampling after a few words on that method.

Thompson SamplingThompson sampling takes a different approach to selecting the next arm to be pulled than ε-greedy.

A downside of ε-greedy is that it potentially gets stuck on local maximums — sub-par distributions that perform well through statistical chance.

The eta parameter is designed to counteract this by forcing the algorithm to take what appears in current view to be a sub-optimal choice (i.


not the current maximum predicted probability), but which may in fact be the optimal choice given enough information.

To do this, Thompson sampling draws from the posterior predictive distributions of each choice using a random uniform variable.

This allows a non-optimal distribution to be sampled with varying frequency — as the posterior distribution becomes more certain, the probability of the choice being made decreases dynamically, thus Thompson sampling dynamically balances the desire for more information with making the currently optimal choice.

For example, in the graph below we assume two choices, where blue has had 4 successes and 1 failure ,and orange has had the reserves.

While the blue choice has a mean ~.

7 and the orange choice has a mean ~0.

3, there is a non-trivial probability (~4%) that orange will be chosen instead of blue.

Our implementation of Thompson sampling looks like this — note that the random.

choice selection now uses a draw from the prior distributions rather than the prior means.

# Thompson Samplingdef thompson(arm_prob, prior): #Determine path-selection s = [np.



loc[n, 'alpha'], prior.

loc[n, 'beta'], 1) for n in range(len(arm_prob))] s = random.


where(s == np.

amax(s))[0]) #Determine success/failure r = 1 if np.


uniform(0, 1, 1) <= arm_prob[s] else 0 prior.

loc[s, 'alpha'] += r prior.

loc[s, 'beta'] += 1- r prior.

loc[s, 'theta'] = prior.

loc[s, 'alpha'] / (prior.

loc[s, 'alpha'] + prior.

loc[s, 'beta']) return s, r, priorThe Face-Off!Pitting the two algorithms against each-other helps clarify their strengths and weaknesses.

In the example below we simulate 5 arms with probabilities drawn from a Beta(1.

5, 1.

1), i.


shifted towards lower probabilities, but still maintaining sufficient opportunity for higher probabilities.

The graph below shows one trial run that is characteristic of these methods.

After 250 iterations both methods have struck upon the optimal lever (Lever 1), however the TS has pulled ahead due to superior performance earlier in the analysis.

After 50 iterations both methods were virtually tied with 30 wins, with EG leading slightly up to that point.

However after that point TS pulls ahead.

EG had started pulling Lever 2 (by change), and saw a success, and continued pulling, to the point that in the first 15 draws it had seen 10 successes (all while EG is assuming other untried levers have probabilities of 0.


EG continues heavily favoring Lever 2 until the 175th iteration, at which point it tries Lever 1 and finds long-term success.

In contrast EG had found Lever 0 as early as the 20th iteration, and pulls it nearly exclusively as early as the 40th iteration.

Bernoulli Multi-Armed Bandit with Lever Probabilities = 0.

86, 0.

44, 0.

56, 0.

34, and 0.


EG using eta = 0.


Each of these algorithms can be tuned — we can adjust the priors in both models, as well as adjusting eta for EG.

In the above example we used a 5% eta.

Using a more aggressive eta allows the model to be less prone to local maxima, while also potentially reducing long-term gains by enforcing sub-optimal choices.

The graph below looks at 1000 iterations using eta = 0.


As predicted we see that EG is even able to beat TS early on (while TS is looking to “prove” that Lever 0 is the right option, EG is willing to make that assumption without as much proof), however as the simulation goes into the 1000th iteration TS gains a slight lead, as EG is still sampling from non-optimal levers 10% of the time.

An improvement on EG could use a decay factor for eta.

Bernoulli Multi-Armed Bandit with Lever Probabilities = 0.

86, 0.

44, 0.

56, 0.

34, and 0.


EG using eta = 0.


The simulation above shows that both of these methods can be effective — they are largely dependent on context.

EG will perform well in situations where making a sub-optimal choice hurts less — if the true probability of the second or third choice is close enough to the true probability of the first choice, you won’t lose much by making those marginally sub-par choices.

Additionally if the search space is expansive, TS may get bogged down trying to prove the best option, while EG is willing to take quick wins without much “proof”.

In the example below, 50 arms are simulated using a Beta(3.

5, 0.

7) which favors giving arms a higher probability (see the histogram below).

As predicted, EG is able to pull out a quick-win while TS is searching the wide space that carries it through iteration 1000.

If we flip the probability distribution to favor low-probability outcomes-Beta(0.

7, 3.

5)-we see each algorithm struggle to find wins.

While the optimal choice wins 67% of the time, after 500 iterations, EG has won 229 times and TS slightly fewer.

However after 1000 iterations TS pulls ahead, a lead which it is predicted to maintain as EG continues to sample an ineffective space.

We see TS take a leg up again in the bi-modal distribution below (a 25/75 mixture of Beta(2.

5, 1) and Beta(2.


ConclusionThompson sampling is a Bayesian approach to the Multi-Armed Bandit problem that dynamically balances incorporating more information to produce more certain predicted probabilities of each lever with the need to maximize current wins.

ε-greedy is a simpler method that focuses more heavily on current wins with a deterministic approach to incorporating more information.

A consequence of this approach is that current wins and more information are sometimes not well-balanced (i.


concentrating on local maxima), but a benefit can be that if the method happens upon a global maximum early on it is more willing to trust it and can produce early-wins.

In future articles we’ll discuss expanding upon this framework using more complex Bayesian sampling techniques (in situations with non-conjugate priors), as well as applications to Thompson sampling to new problems: non-binomial multi-armed bandit, path optimization, and more.


. More details

Leave a Reply