Is there a mistake in Lecture 5 of Stanford CS234 available on youtube?

https://datascience.stackexchange.com/questions/76833

reinforcement-learning

12-12-2020
|

Question

https://www.youtube.com/watch?v=buptHUzDKcE&list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u&index=5

At 53:45 Professor starts to describe temporal difference for linear value function approximation. At 56:20 on slide one can see how weights are updated. Is equation for $ \Delta w $ correct?

In my opinion thing in brackets should be multiplied by $ X(s) - \gamma X(s') $ instead of $ X(s) $ because $ \frac {\partial ( \gamma X(s')^T w )} {\partial w} $ is not zero. Am i right?

Solution

It's the notation that might be a bit confusing. Take a look at David Silver slides: pages 10-15. He has a complete derivation. Do not forget that the term $r + \gamma V(s';w)$ is the target. She mentions that in the video and the fact that you are actually doing supervised learning with target provided by a bootstrapped value (you try to minimize the Bellman error).

In other words: you are trying to minimize the error between a target value and the estimation from your model. You do not know the target value so you estimate it with bootstrapping. Then you have $SE=(y - \hat{y}(w))^2$. The target y is considered known (as in supervised learning) so eventually you are searching for the weights $w$ that will make the output $\hat{y}$ of your model close to y. What I wrote here applies for batch training (minimizing the mean square error MSE).

So eventually yes the quantity $V(s';w)$ is considered constant and will have derivative 0.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange