Value Updation Dynamic Programming Reinforcement learning

https://datascience.stackexchange.com/questions/6719

16-10-2019
|

Question

Regarding Value Iteration of Dynamic Programming(reinforcement learning) in grid world, the value updation of each state is given by:

Now Suppose i am in say box (3,2). I can go to (4,2)(up) (3,3)(right) and (1,3)(left) and none of these are my final state so i get a reward of -0.1 for going in each of the states. The present value of all states are 0. The probability of going north is 0.8, and going left/right is 0.1 each. So since going left/right gives me more reward(as reward*probability will be negative) i go left or right. Is this the mechanism. Am I correct? But In the formula there is a summation term given. So I basically cannot understand this formula. Can anyone explain me with an example?

Solution

The probabilities you describe refer only to the go-north action. It means that if you want to go north, you have 80% chance of actually going north and 20% of going left or right, making the problem more difficult (non-deterministic). This rule applies to every direction. Also, the formula does not tell which action to chose, just how to update the values. In order to select an action, assuming a greedy-policy, you'd select the one with the highest expected value ($V(s')$).

The formula says to sum the values for all possible outcomes from the best action. So, supposing go-north is indeed the best action, you have:

$$.8 * (-.1 + 0) + .1 * (-.1 + 0) + .1 * (-.1 + 0) = -.1$$

But let us suppose that you still don't know which is the best action and want to select one greedily. Then you must compute the sum for each possible action (north, south, east, west). Your example has all values set to 0 and the same reward and so is not very interesting. Let's say you have a +1 reward to east (-0.1 for the remaining directions) and that south already has V(s) = 0.5 (0 for the remaining states). Then you compute the value for each action (let $\gamma = 1$, since it is a user-adjusted parameter):

North: $.8 * (-.1 + 0) + .1 * (-.1 + 0) + .1 * (1 + 0) = -.08 - .01 + .1 = .01$
South: $.8 * (-.1 + .5) + .1 * (-.1 + 0) + .1 * (1 + 0) = 0.32 - .01 + .1 = .41$
East: $.8 * (1 + 0) + .1 * (-.1 + 0) + .1 * (-.1 + .5) = .8 - .01 + .04 = .83$
West: $.8 * (-.1 + 0) + .1 * (-.1 + 0) + .1 * (-.1 + .5) = -.08 - .01 + .04 = -.05$

So you would update your policy to go East from the current state, and update the current state value to 0.83.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange