Question

I'm trying to implement eligibility traces (forward looking), whose pseudocode can be found in the following image

enter image description here

I'm uncertain what the For all s, a means (5th line from below). Where do they get that collection of s, a from?

If it's forward-looking, do loop forward from the current state to observe s'?

Do you adjust every single e(s, a)?

Was it helpful?

Solution

It's unfortunate that they've reused the variables s and a in two different scopes here, but yes, you adjust all e(s,a) values, e.g.,

for every state s in your state space
    for every action a in your action space
        update Q(s,a)
        update e(s,a)

Note what's happening here. e(s,a) is getting incremented by an exponentially decreasing amount. But right before you go into that loop, you increment the single e(s,a) corresponding to the state/action pair just visited. So that pair gets "reset" in a way -- it doesn't get the exponentially smaller update, and on the next iteration, it's update will continue to be larger than all the pairs you haven't recently visited. Every time you visit a state/action pair, you're increasing the weight it contributes to the update of Q for a few iterations.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top