In calculating policy gradients, wouldn't longer trajectories have more weight according to the policy gradient formula?

datascience.stackexchange https://datascience.stackexchange.com/questions/47577

Question

In Sergey Levine's lecture on policy gradients (berkeley deep rl course), he show that policy gradient can be evaluated according to the formula policy gradient formula

In this formula, wouldn't longer trajectories get more weight (in finite horizon situations), since the middle term, the sum over log pi, would involve more terms? (Why would it work like that?)

The specific example I have in mind is pacman, longer trajectories would contribute more to the gradient. Should it work like that?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top