Eligibility Traces in Reinforcement Learning

Eligibility Traces in Reinforcement LearningZiad SALLOUMBlockedUnblockFollowFollowingJun 4Sometimes looking backward isn’t that badPhoto by Didier Provost on UnsplashWhat is Eligibility Traces ?In short and a straight forward manner, Eligibility Traces is a kind of mathematical trick that improves the performance of Temporal Difference methods, in Reinforcement Learning.

Here are the benefits of Eligibility Traces:Provide a way of implementing Monte Carlo in online fashion (does not wait for the episode to finish) and on problems without episodes.

Provide an algorithmic mechanism that uses a short-term memory vector.

Computational efficiency by storing a single vector memory instead a list of feature vectors.

Learning is done continually rather than waiting results at the end of an episode.

The Forward ViewRemember that in Temporal Difference and Monte Carlo methods update a state based on future rewards.

This is done either by looking directly one step ahead or by waiting the episode to finish.

This approach is called the Forward View.

In Forward View we look ahead n steps for future rewardsIn TD(0) we look one step ahead, while in Monte Carlo we look ahead until the episode is terminated and we collect the discounted results.

However there is a middle ground, in which we look n-steps ahead.

The n-steps Forward ViewAs explained in the previous section, looking ahead can vary from one step ahead to the end of the episode as the case of Monte Carlo.

So, n-steps is some kind of middle ground.

Remember that in Monte Carlo we execute the episodes get their returns Gi and average those returns to compute the state value.

Note that length (number of steps) of each episode may vary from one episode to the other.

It is not constant!Similarly we will do the same with n-steps look-ahead.

As in Monte Carlo the number of steps is not necessarily the same on each episode.

NB.

in this section we no longer refer to an episode as a number of steps that must terminate by reaching a terminal state, but simply as a number of steps (the terminal state is no more a requirement).

So let’s define an average return for all these iterations like the following:Where G( ????, t) is the weighted average of all returns G(t,t+n) which are the returns of individual episodes where each episode starts at t and ends at t+n, for n going from 1 to infinity.

 ????.is a weight that has a value between [0, 1].

As in all weighted average, the sum of the weights must be one, which is the case sinceImportant Remark: it is easy to notice that for episodes with large n, the ????.to the power n becomes small, and the contribution of G(t+n) will be small.

But hey!.We already know that from the use of the discount factor gamma (ɣ) that is in the return definition:But bear in mind this is a different issue, in Monte Carlo for example V(s) is the average of all returns Gi computed from the episodes that has been played.

All Gi contributed equally to the value of V(s), even though the rewards are discounted depending on their distance from state s.

In here, the issue is different, all Gi do not contribute the same way to V(s) but they are weighted, and each weight gets smaller following the number of steps within each episode.

Below is a picture showing how each episode is weighted according to its length.

This pictures shows how the weight becomes smaller as the time (or n-steps) increases.

In short if an episode terminates after 3 steps the weight associated with its return is far greater than an episode that terminates at T steps (where T is much greater than 3).

It is also important to notice that the weight decreases exponentially.

Let’s rewrite the return when the episodes terminates at time T.

Note that the return after time step T is always Gt (which is the return at time step T) simply because there are no more states, and last return seen was at T.

When ????.is 0, the expression of Gt shows the formula of TD(0) (note that zero to the power zero in this context is 1 by convention, see this Wikipedia article for more info), and when ????.is 1 the Gt becomes the formula of Monte Carlo.

The Problem is not Resolved Yet!All these details but still we didn’t solve the problem…Forward views are somehow complex to implement because the update of each state depends on later events or rewards that are not available at current time.

But so far we have been doing just that! We are looking n-steps ahead…However this is going to change by adopting a new approach: The Backward View.

Backward View TD( ????)Suppose an agent randomly walking in an environment and finds a treasure.

He then stops and looks backwards in an attempt to know what led him to this treasure ?Naturally the steps that are close to the treasure have more merits in finding it than the steps that are miles away.

So closer locations are more valuable than distant ones and thus they are assigned bigger valuesHow does this materialize, is through a vector E called eligibility traces.

Concretely, the eligibility traces is a function of state E(s) or state action E(s,a) and holds the decaying values of the V(s).

So how do we transit from Forward View to Backward View and what is the role of eligibility traces in that?Remember what we said about Forward View, that the contribution of each episode to the current state is attenuated exponentially (???? to the power n) following the number of steps (n) in the episode.

Using the same logic, when we are at state s, instead of looking ahead and see the decaying return (Gt) of an episode coming towards us, we simply use the value we have and throw it backward using the same decaying mechanism.

For example in TD(0) we defined the TD error as:This error will be propagated backwards but in a decaying manner.

Similar to the voice that fades away with the distance.

The way we implement this, is by multiplying ????.by the eligibility trace at each state.

Where Et(s) is updated as follows:Backward View propagates the error δ to previous statesThe notation 1(St = s) means that we assign the full value when we are at the state s, and as it gets propagated backwards it gets attenuated exponentially.

The eligible trace update starts by E(s) = 0 for all states, then as we pass by each state (due to performing an action) we increment E(s) to boost the value of the state, then we decay E(s) by ɣ????.(E(s) = ɣ????.E(s)) for all s.

The main advantage of eligibility traces over n-step forward view is that only one single trace vector is required rather than a store of the last n feature vectors.

Algorithms Using Eligibility TracesThe following are few algorithms that uses Eligibility Traces.

ConclusionEligibility traces is a way of weighting between temporal-difference “targets” and Monte-Carlo “returns”.

Meaning that instead of using the one-step TD target, we use TD(λ) target.

 In other words it fine tunes the target to have a better learning performance.

.

. More details

Leave a Reply