Question

I think I get the main idea, and I almost understand the derivation except for this one line, see picture below:

enter image description here

I understand what we're doing by using the policy probability to weight the rewards from time t + 2 (because getting here depends on the prob of taking an action that gets here). But I don't understand why we similarly subtract the value function from the return...

It also doesn't seem to match the example target return (G) implied for 2 step backup on slide 15 of this lecture's slides:

https://www.dropbox.com/sh/3xowt7qvyadvejn/AABpWQMKWX3KVbeqVlBcxNYra/slides%20(pdf%20and%20keynote)?dl=0&preview=13-multistep.pdf

Thanks for any insight. I could be missing something simple/obvious as I dive into these details.

EDIT - for more context, see pg. 160 of this pdf which is where the picture comes from: http://incompleteideas.net/sutton/book/bookdraft2016sep.pdf

Was it helpful?

Solution

The slides and the book are consistent. Notice how in the slides there is a restriction in the summatory: i.e. $a \neq A_{t+1}$. For $G^{(2)}$, you need to "remove" from $V_{t+1}$ the term that should not be there, i.e. $A_{t+1}$.

Now, why this term is removed?

If you keep this term you will be adding $A_{t+1}$ twice. In 1-step backup, it is part of the expectation of step $S_{t+1}$.

When you calculate 2-step backup you want to replace $(S_{t+1}, A_{t+1})$ in the 1-step expectation with the discounted expected value of $S_{t+2}$. So you substract the term and add the discounted expectation for $S_{t+2}$

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top