In calculating policy gradients, wouldn't longer trajectories have more weight according to the policy gradient formula?
-
01-11-2019 - |
Question
In Sergey Levine's lecture on policy gradients (berkeley deep rl course), he show that policy gradient can be evaluated according to the formula
In this formula, wouldn't longer trajectories get more weight (in finite horizon situations), since the middle term, the sum over log pi, would involve more terms? (Why would it work like that?)
The specific example I have in mind is pacman, longer trajectories would contribute more to the gradient. Should it work like that?
No correct solution
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange