Question

Suppose you're given an MDP where rewards are attributed for reaching a state, independently of the action. Then when doing value iteration:

$$ V_{i+1} = \max_a \sum_{s'} P_a(s,s') (R_a(s,s') + \gamma V_i(s'))$$

what is $R_a(s,s')$ ?

The problem I'm having is that terminal states have, by default, $V(s_T) = R(s_T)$ (some terminal reward). Then when I'm trying to implement value iteration, if I set $R_a(s,s')$ to be $R(s')$ (which is wha I thought), I get that states neighboring a terminal state have a higher value than the terminal state itself, since

$$ P_a(s,s_T) ( R_a(s,s_T) + \gamma V_i(s_T) ) $$

can easily be greater than $V_i(s_T)$, which in practice makes no sense. So the only conclusion I seem to be able to get is that in my case, $R_a(s,s') = R(s)$.. is this correct?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top