Firstly, using TD(0) appears unfair to some states, for example person D, who, at this stage, has gained nothing from the paper reaching the bin two out of three times.

Their update has only been affected by the value of the next stage, but this emphasises how the positive and negative rewards propagate outwards from the corner towards the states.

As we take more episodes the positive and negative terminal rewards will spread out further and further across all states.

This is shown roughly in the diagram below where we can see that the two episodes the resulted in a positive result impact the value of states Teacher and G whereas the single negative episode has punished person M.

To show this, we can try more episodes.

If we repeat the same three paths already given we produce the following state value function:(Please note, we have repeated these three episodes for simplicity in this example but the actual model would have episodes where the outcomes are based on the observed transition probability function.

)The diagram above shows the terminal rewards propagating outwards from the top right corner to the states.

From this, we may decide to update our policy as it is clear that the negative terminal reward passes through person M and therefore B and C are impacted negatively.

Therefore, based on V27, for each state we may decide to update our policy by selecting the next best state value for each state as shown in the figure belowThere are two causes for concerns in this example: the first is that person A’s best action is to throw it into the bin and net a negative reward.

This is because none of the episodes have visited this person and emphasises the multi armed bandit problem.

In this small example there are very few states so would require many episodes to visit them all, but we need to ensure this is done.

The reason this action is better for this person is because neither of the terminal states have a value but rather the positive and negative outcomes are in the terminal rewards.

We could then, if our situation required it, initialise V0 with figures for the terminal states based on the outcomes.

Secondly, the state value of person M is flipping back and forth between -0.

03 and -0.

51 (approx.

) after the episodes and we need to address why this is happening.

This is caused by our learning rate, alpha.

For now, we have only introduced our parameters (the learning rate alpha and discount rate gamma) but have not explained in detail how they will impact results.

A large learning rate may cause the results to oscillate, but conversely it should not be so small that it takes forever to converge.

This is shown further in the figure below that demonstrates the total V(s) for every episode and we can clearly see how, although there is a general increasing trend, it is diverging back and forth between episodes.

Another good explanation for learning rate is as follows:“In the game of golf when the ball is far away from the hole, the player hits it very hard to get as close as possible to the hole.

Later when he reaches the flagged area, he chooses a different stick to get accurate short shot.

So it’s not that he won’t be able to put the ball in the hole without choosing the short shot stick, he may send the ball ahead of the target two or three times.

But it would be best if he plays optimally and uses the right amount of power to reach the hole.

”Learning rate of a Q learning agentThe question how the learning rate influences the convergence rate and convergence itself.

If the learning rate is…stackoverflow.

comEpisodeThere are some complex methods for establishing the optimal learning rate for a problem but, as with any machine learning algorithm, if the environment is simple enough you iterate over different values until convergence is reached.

This is also known as stochastic gradient decent.

In a recent RL project, I demonstrated the impact of reducing alpha using an animated visual and this is shown below.

This demonstrates the oscillation when alpha is large and how this becomes smoothed as alpha is reduced.

Likewise, we must also have our discount rate to be a number between 0 and 1, oftentimes this is taken to be close to 0.

9.

The discount factor tells us how important rewards in the future are; a large number indicates that they will be considered important whereas moving this towards 0 will make the model consider future steps less and less.

With both of these in mind, we can change both alpha from 0.

5 to 0.

2 and gamma from 0.

5 to 0.

9 and we achieve the following results:Because our learning rate is now much smaller the model takes longer to learn and the values are generally smaller.

Most noticeably is for the teacher which is clearly the best state.

However, this trade-off for increased computation time means our value for M is no longer oscillating to the degree they were before.

We can now see this in the diagram below for the sum of V(s) following our updated parameters.

Although it is not perfectly smooth, the total V(s) slowly increases at a much smoother rate than before and appears to converge as we would like but requires approximately 75 episodes to do so.

Changing the Goal OutcomeAnother crucial advantage of RL that we haven’t mentioned in too much detail is that we have some control over the environment.

Currently, the rewards are based on what we decided would be best to get the model to reach the positive outcome in as few steps as possible.

However, say the teacher changed and the new one didn’t mind the students throwing the paper in the bin so long as it reached it.

Then we can change our negative reward around this and the optimal policy will change.

This is particularly useful for business solutions.

For example, say you are planning a strategy and know that certain transitions are less desired than others, then this can be taken into account and changed at will.

ConclusionWe have now created a simple Reinforcement Learning model from observed data.

There are many things that could be improved or taken further, including using a more complex model, but this should be a good introduction for those that wish to try and apply to their own real-life problems.

I hope you enjoyed reading this article, if you have any questions please feel free to comment below.

ThanksSterling.