If such a change will suddenly prevent us from taking action 2, then action X becomes the wrong decision.
But there’s more to that than just paranoia.
Just a few lines ago we agreed that after choosing action Y we can be more flexible in the next action selection.
While there’s still the best option, the others are not too far off of it, and this can allow the model to explore these other actions more, as the price payed for not choosing the optimal one is low.
That cannot be said about the same scenario after choosing action X, and as we know — sufficient exploration is vital and crucial for a robust Reinforcement Learning agent.
Let’s Talk BusinessHow to design a general policy that encourages an agent to maximize entropy is presented in the paper I linked to above.
Here I’d like to focus on the Soft Bellman Equation (discussed in the blogpost I referred to).
Let’s first refresh our memories with the regular Bellman Equation:The Bellman EquationThe Soft Bellman Equation will try to maximize entropy rather than future reward.
Therefore, it shall replace the last term, where we maximize over the future Q Value, with an entropy-maximization term.
And so, in the case of a finite number of actions, the Soft Bellman Equation is:The Soft Bellman EquationIf you’d like to see the mathematical proof for how this new term relates to the entropy, the blogpost authors claim it is found in this 236 pages Ph.
D thesis by Brian Ziebart.
If you’ve ever taken a Thermodynamics class once, you can get the general idea if you recall that the thermodynamic entropy of a gas is defined as S=k⋅lnΩ, where the number of configurations, Ω, at equilibrium, is Ω≈exp(N), where N is the number of particles.
If you have no idea what you’ve just read, you’ll just have to trust me (or read the thesis).
Does It Work?If you’ve read some of my blogposts so far, you might have noticed that I like to test things myself, and this case isn’t different.
I’ve wrote once about a Tic-Tac-Toe agent I trained using Deep Q-Networks, so I decided to edit it and allow it to learn using the Soft Bellman Equation too.
I then trained two agents: a regular Q-Network agent and a Max-Entropy-Q-Network agent.
I trained two such agents playing against each other and then two other agents playing separately against an external player — and repeated this process 3 times, ending up with 6 different trained models of each type.
I then matched all regular Q-Network agents with all Max-Entropy-Q-Network agents to see which type of agent wins the most games.
I also forced the agents to select a different first move each game (to cover all possible game-options), and made sure they both get to play both X and O.
The results are very clear: of the 648 games played, the Soft Bellman agents won 36.
1% of the games (234), 33.
5% ended in a tie (217) and only 30.
4% of the games (197) were won by the regular Q-Network agents.
When considering only the games played without me forcing the agents to make a specific first action, but rather letting them play as they wish, the results were even more in favor of the Soft Bellman agents: of 72 games played, 40.
3% (29) were won by the Max-Entropy-Q-Network, 33.
3% (24) were won by the regular agents, and the rest 26.
4% (19) ended with no winner.
I encourage you to run this experiment yourself too!Final WordsThis experiment have demonstrated that while learning complex systems, and even not so complex systems, following broader objectives other than the highest reward can be quite beneficial.
As I see it, teaching a model such a broader policy, is as if we no longer treat the agent as a pet we wish to train — but a human we’re trying to teach.
I’m excited to see what more can we teach it in the future!.