Reinforcement Learning — Multi-Arm Bandit Implementation

Reinforcement Learning — Multi-Arm Bandit ImplementationJeremy zhangBlockedUnblockFollowFollowingMay 25Multi-Arm Bandit is a classic reinforcement learning problem, in which a player is facing with k slot machines or bandits, each with a different reward distribution, and the player is trying to maximise his cumulative reward based on trials.

k-armed banditFormulationLet’s strike into the problem directly.

There are 3 key components in a reinforcement learning problem — state, action and reward.

Let’s recall the problem — k machines are placed in front of you, and in each episode, you choose one machine and pull the handle, by taking this action, you get a reward accordingly.

So the state is the current estimation of all the machines, which is zeros for all in the beginning, the action is the machine you decide to choose at each episode, and the reward is the result or payout after you pull the handle.

With these 3 items in our mind, we are able to write the init function of Bandit (check out code here):k is the number of bandits we hope to initialise, actions are represented as numbers, which each stands for the machine we are going to pick up, and we have exp_rate (exporation rate) and lr (learning rate) as usual to control exploration and value updating.

total_reward and avg_reward are used to measure our result.

The payout of each machine is a random value generated from normal distribution, and finally the initial estimates of values are set to 0.

ActionsChoosing ActionTo identify the machine with the most reward, the straight forward way is to try, try as many times as possible until one has a certain confidence with each machine, and stick to the best estimated machine from onward.

The thing is we probably can test in a wiser way.

As our goal is to get the maximum reward along the way, we surely should not waste to much time on a machine that always gives a low reward, and on the other hand, even we come across a palatable reward of a certain machine, we should still be able to explore other machines in the hope of some not-explored-enough machines could give us a higher reward.

So when it comes to choosing actions, we will explore to methods:ϵ-greedyUCB (Upper-Confidence-Bound)We have talked about ϵ-greedy many times, where we take random actions with ϵ probability and maximum action with 1 – ϵ probability.

This makes a good balance between exploration and exploitation.

Let’s talk a bit on UCB which is based on the formula below:UCBwhere lnt denotes the natural logarithm of t(the number that e = 2.

71828would have to be raised to in order to equal t), Nt(a) denotes the number of times that action a has been selected prior to time t(the denominator in the formula, and the number c > 0controls the degree of exploration.

If Nt(a)=0, then a is considered to be a maximising action.

(From Reinforcement Learning: An Introduction)So as the number of trials increase, if a bandit has not yet been explored much, its Nt(a) will be small, thus the right side of the formula will be large.

In this way, the formula helps to balance between the current estimation Q and exploration rate.

To implement UCB, we need to add some variables into the init function:The self.

times counts the total number of trials and self.

action_times counts the number of actions for each bandit.

Now we can put both ϵ-greedy and UCB into our method choosing function:When self.

ucb is turned on, we will choose action based on the formula.

The self.

action_times is added by 0.

1 in case value divided by 0.

If exp_rate if set to 0 and self.

ucb is true, the action will be chosen purely by UCB, but if exp_rate > 0 , then we will have a hybrid method that mixes ϵ-greedy and UCB.

Taking Action and Updating EstimatesThe takeAction function receives an action and updates estimates of that action after receiving a reward.

As the problem described, each bandit’s payout conforms to a certain distribution, thus we add some randomness to each true reward.

After receiving the reward, the estimates will be updated by adding the difference between the current observation and previous estimation multiplied by a learning rate.

PlayAnd finally, we need a function to start playing.

ResultWe init a 5-arm bandit and compare the average gain among different methods.

(check out code here)This is the average reward over 2000 iterations using ϵ-greedy method with different exploration rate.

The 0.

1 exploration rate surpass the 0.

3 one when it converges.

We add UCB method with different c values into comparison.

Both the UCB method have exploration rate 0.


From the result, we see that in general, UCB outperforms ϵ-greedy, but it is also in general hard to apply UCB on many other problems beside bandit.

Please check out full code here, and you are welcomed to contribute and raise issues!Reference:http://incompleteideas.




. More details

Leave a Reply