Introduction to Monte Carlo Tree Search: The Game-Changing Algorithm behind DeepMind’s AlphaGo

Until we reach the leaf node, we randomly choose an action at each step and simulate this action to receive an average reward when the game is over.

Loop Forever:if Si is a terminal state: return Value(Si)Ai = random(available_actions(Si))Si = Simulate(Si, Ai)This loop will run forever until you reach a terminal state.

Flowchart for Monte Carlo Tree SearchTree Traversal & Node ExpansionYou start with S0, which is the initial state.

If the current node is not a leaf node, we calculate the values for UCB1 and choose the node that maximises the UCB value.

We keep doing this until we reach the leaf node.

Next, we ask how many times this leaf node was sampled.

If it’s never been sampled before, we simply do a rollout (instead of expanding).

However, if it has been sampled before, then we add a new node (state) to the tree for each available action (which we are calling expansion here).

Your current node is now this newly created node.

We then do a rollout from this step.

Complete Walkthrough with an exampleLet’s do a complete walkthrough of the algorithm to truly ingrain this concept and understand it in a lucid manner.

Iteration 1:We start with an initial state S0.

Here, we have actions a1 and a2 which lead to states s1 and s2 with total score t and number of visits n.

But how do we choose between the 2 child nodes?Initial StateThis is where we calculate the UCB values for both the child nodes and take whichever node maximises that value.

Since none of the nodes have been visited yet, the second term is infinite for both.

Hence, we are just going to take the first nodeWe are now at a leaf node where we need to check whether we have visited it.

As it turns out, we haven’t.

In this case, on the basis of the algorithm, we do a rollout all the way down to the terminal state.

Let’s say the value of this rollout is 20Rollout from S1Now comes the 4th phase, or the backpropogation phase.

The value of the leaf node (20) is backpropogated all the way to the root node.

So now, t = 20 and n = 1 for nodes S1 and S0Post BackpropogationThat’s the end of the first iterationThe way MCTS works is that we run it for a defined number of iterations or until we are out of time.

This will tell us what is the best action at each step that one should take to get the maximum return.

Iteration 2:We go back to the initial state and ask which child node to visit next.

Once again, we calculate the UCB values, which will be 20 + 2 * sqrt(ln(1)/1) = 20 for S1 and infinity for S2.

Since S2 has the higher value, we will choose that nodeRollout will be done at S2 to get to the value 10 which will be backpropogated to the root node.

The value at root node now is 30Backpropogation from S2Iteration 3:In the below diagram, S1 has a higher UCB1 value and hence the expansion should be done here:Now at S1, we are in exactly the same position as the initial state with the UCB1 values for both nodes as infinite.

We do a rollout from S3 and end up getting a value of 0 at the leaf nodeIteration 4:We again have to choose between S1 and S2.

The UCB value for S1 comes out to be 11.

48 and 12.

10 for S2:We’ll do the expansion step at S2 since that’s our new current node.

On expansion, 2 new nodes are created — S5 and S6.

Since these are 2 new states, a rollout is done till the leaf node to get the value and backpropogateThat is the gist of this algorithm.

We can perform more iterations as long as required (or is computationally possible).

The underlying idea is that the estimate of values at each node becomes more accurate as the number of iterations keep increasing.

End NotesDeepmind’s AlphaGo and AlphaGo Zero programs are far more complex with various other facets that are outside the scope of this article.

However, the Monte Carlo Tree Search algorithm remains at the heart of it.

MCTS plays the primary role in making complex games like Go easier to crack in a finite amount of time.

Some open source implementations of MCTS are linked below:Implementation in PythonImplementation in C++I expect reinforcement learning to make a lot of headway in 2019.

It won’t be surprising to see a lot more complex games being cracked by machines soon.

This is a great time to learn reinforcement learning!I would love to hear your thoughts and suggestions regarding this article and this algorithm in the comments section below.

Have you used this algorithm before?.If not, which game would you want to try it out on?Originally published at www.

analyticsvidhya.

com on January 23, 2019.

.. More details

Leave a Reply