Question

Discounted rewards seems unbalanced to me.

If we take as example an episode with 4 actions, where each action receive a reward of +1 :

+1 -> +1 -> +1 -> +1

The discounted reward for the last action is : 1
The discounted reward for the first action (considering gamma = 1 for simplicity) is : 4

Intuitively both action are as good as the other, because both received same reward.
But their total reward is different, unbalanced.


So when we will backpropagate, first action will be favored over last action ?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top