Question

When reading pytorch tutorial:

Our aim will be to train a policy that tries to maximize the discounted, cumulative reward Rt0=∑∞t=t0γt−t0rt, where Rt0 is also known as the return

I know γ is the discount factor, but I am not sure that what t-t0 ofγt−t0 mean?

Thank you.

Was it helpful?

Solution

I have no experience with reinformcement learning, however looking at the figure I think I understand what is meant. Gamma is the discount factor, which is taken to the power t-t0, i.e. the number of episodes starting from t. This gives the discount factor for a specific episode, which is then multiplied by the return of that episode, r_t, to get the discounted reward for one specific episode. The total return is then computed by summing all the future rewards for future episodes.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top