In calculating policy gradients, wouldn't longer trajectories have more weight according to the policy gradient formula?

https://datascience.stackexchange.com/questions/47577

reinforcement-learning
policy-gradients

01-11-2019
|

Question

In Sergey Levine's lecture on policy gradients (berkeley deep rl course), he show that policy gradient can be evaluated according to the formula

In this formula, wouldn't longer trajectories get more weight (in finite horizon situations), since the middle term, the sum over log pi, would involve more terms? (Why would it work like that?)

The specific example I have in mind is pacman, longer trajectories would contribute more to the gradient. Should it work like that?

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange