# Nuts and Bolts of Reinforcement Learning: Introduction to Temporal Difference (TD) Learning

The learning rate, also called step size, is useful for convergence.

Since we take the difference between the actual and predicted values, this is like an error.

We can call it a TD error.

Notice that the TD error at each time is the error in the estimate made at that time.

Because the TD error depends on the next state and next reward, it is not actually available until one timestep later.

Iteratively, we will try to minimize this error.

Understanding TD Prediction using the Frozen Lake Example Let us understand TD prediction with the frozen lake example.

The frozen lake environment is shown next.

First, we will initialize the value function as 0, as in V(S) as 0 for all states, as shown in the following state-value diagram: Say we are in a starting state (s) (1,1) and we take an action right and move to the next state (s’) (1,2) and receive a reward (r) as -0.

4.

How can we update the value of the state using this information?.Recall the TD update equation: Let us consider the learning rate (α) as 0.

1 and the discount factor () as 0.

5; we know that the value of the state (1,1), as in v(s), is 0 and the value of the next state (1,2), as in V(s’), is also 0.

The reward (r) we obtained is -0.

3.

We substitute this in the TD rule as follows: V(s) = 0 + 0.

1 [ -0.

4 + 0.

5 (0)-0] V(s) = – 0.

04 So, we update the value for the state (1,1) as -0.

04 in the value table, as shown in the following diagram: Now that we are in the state (s) as (1,2), we take an action right and move to the next state (s’) (1,3) and receive a reward (r) -0.

4.

How do we update the value of the state (1, 2) now?.We will substitute the values in the TD update equation: V(s) = 0 + 0.

1 [ -0.

4 + 0.

5(0)-0 ] V(s) = -0.

04 Go ahead and update the value of state (1,2) as -0.

04 in the value table: With me so far?.We are now in the state (s) (1,3).

Let’s take an action left.

We again go back to that state (s’) (1,2) and we receive a reward (r) -0.

3.

Here, the value of the state (1,3) is 0 and the value of the next state (1,2) is -0.

03 in the value table.

Now we can update the value of state (1,3) as follows: V(s) = 0 +0.

1 [ -0.

4 + 0.

5 (-0.

04)-0) ] V(s) = 0.

1[-0.

42] V(s) = -0.

042 You know what to do now.

Update the value of state (1,3) as -0.

042 in the value table: We update the value of all the states in a similar fashion using the TD update rule.

To summarize, the steps involved in the TD prediction algorithm are: First, initialize V(S) to 0 or some arbitrary value Then, begin the episode.

For every step in the episode, perform an action A in the state S and receive a reward R and move to the next state (s’) Update the value of the previous state using the TD update rule Repeat steps 2 and 3 until we reach the terminal state   Understanding Temporal Differencing Control In Temporal Difference prediction, we estimated the value function.

In TD control, we optimize the value function.

There are two kinds of algorithms we use for TD Control: Off-policy learning algorithm: Q-learning On-policy learning algorithm: SARSA   Off-Policy vs On-Policy What’s the difference between off-policy and on-policy learning?.The answer lies in their names: Off-policy learning: The agent learns about policy π from experience sampled from another policy µ On-policy learning: The agent learns about policy π from experience sampled from the same policy π Let me break this down in the form of an example.

Let’s say you joined a new firm as a data scientist.

In this scenario, you can equate on-policy learning as learning on the job.

You’ll be trying different things and learning only from your own experience.

Off-policy learning would be where you have full access to the actions of another employee.

All you do in this scenario is learn from that employee’s experience and not repeat something that the employee has failed at.

Q-learning Q-learning is a very popular and widely used off-policy TD control algorithm.

In Q learning, our concern is the state-action value pair—the effect of performing an action a in the state s.

This tells us how good an action is for the agent at a particular state (Q(s,a)), rather than looking only at how good it is to be in that state (V(s)) We will update the Q value based on the following equation: Why is Q-learning considered as an off-policy technique?.This is because it updates its Q-values using the Q-value of the next state ????′ and the greedy action ????′.

In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy (maxQ(s’a’)) was followed despite the fact that it’s not following a greedy policy!.The above equation is similar to the TD prediction update rule with a subtle difference.

Below are the steps involved in Q-learning (I want you to notice the difference here): First, initialize the Q function to some arbitrary value Take an action from a state using epsilon-greedy policy () and move it to the new state Update the Q value of a previous state by following the update rule Repeat steps 2 and 3 till we reach the terminal state Now, let’s go back to our example of Frozen Lake.

Let us say we are in a state (3,2) and have two actions (left and right).

Refer to the below figure: We select an action using the epsilon-greedy policy in Q-learning.

We either explore a new action with the probability epsilon or we select the best action with a probability 1 – epsilon.

Suppose we select a probability epsilon and select a particular action (moving down): We have performed a downward action in the state (3,2) and reached a new state (4,2) using the epsilon-greedy policy.

How do we update the value of the previous state (3,2) using our update rule?.It’s pretty straightforward!.Let us consider alpha as 0.

1, the discount factor as 1 and reward as 0.

4: Q( (3,2) down) = Q( (3,2) down ) + 0.

1 ( 0.

4 + 1 max [Q( (4,2) action) ]- Q( (3,2), down) We can say the value of a state (3,2) with downward action is 0.

6 in the Q table.

What is max Q ( (4,2), action) for the state (4,2)?.We have explored three actions (up, down, and right) so we will take the maximum value based on these actions only.

There is no exploration involved here – this is a straightforward greedy policy.

Based on the previous Q table, we can plug in the values: Q( (3,2), down) = 0.

6 + 0.

1 ( 0.

4 + 1 * max [0.

2, 0.

4, 0.

6] – 0.

6) Q( (3,2), down) = 0.

64 So, we update the value of Q ((3,2), down) to 0.

64.

Now, we are in the (4,2) state.

What action should we perform?.Based on the epsilon-greedy policy, we can either explore a new action with a probability epsilon, or select the best action with a probability 1 – epsilon.

Let us say we select the latter option.

So, in (4,2), the action right has a maximum value and that’s what we’ll select: Right, we are have moved to the state (4,3).

Things are shaping up nicely so far.

But wait – how do we update the value of the previous state?.Q( (4,2), right) = Q( (4,2), right ) + 0.

1 ( 0.

4 + 1*max [Q( (4,3) action) ]- Q( (4,2), right) If you look at the Q table that follows, we have explored only two actions (up and down) for the state (4,3).

So, we will take a maximum value based only on these actions (we will not perform an epsilon-greedy policy here; we simply select the action which has maximum value): Q ( (4,2), right) = Q((4,2),right) + 0.

1 (0.

4 + 1 max [ (Q (4,3), up) , ( Q(4,3),down) ] – Q ((4,2), right) Q ( (4,2), right) = 0.

6 + 0.

1 (0.

4 + 1 max [ 0.

2,0.

4] – 0.

8) = 0.

6 + 0.

1 (0.

4 + 1(0.

4) – 0.

6) = 0.

62 Awesome!. More details