Markov decision process - how to use optimal policy formula?

https://stackoverflow.com/questions/9055397

04-12-2019
|

Question

I have a task, where I have to calculate optimal policy (Reinforcement Learning - Markov decision process) in the grid world (agent movies left,right,up,down).

In left table, there are Optimal values (V*). In right table, there is sollution (directions) which I don't know how to get by using that "Optimal policy" formula. Y=0.9 (discount factor)

enter image description here

And here is formula:

enter image description here

So if anyone knows how to use that formula, to get solution (those arrows), please help.

Edit: there is whole problem description on this page: http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node35.html Rewards: state A(column 2, row 1) is followed by a reward of +10 and transition to state A', while state B(column 4, row 1) is followed by a reward of +5 and transition to state B'. You can move: up,down,left,right. You cannot move outside the grid or stay in same place.

Solution

Break the math down, piece by piece:

The arg max (....) is telling you to find the argument, a, which maximizes everything in the parentheses. The variables a, s, and s' are an action, a state you're in, and a state that results from that action, respectively. So the arg max (...) is telling you to find an action that maximizes that term.

You know gamma, and someone did the hard work of calculating V*(s'), which is the value of that resulting state. So you know what to plug in there, right?

So what is p(s,a,s')? That is the probability that, starting from s and doing a, you end in some s'. This is meant to represent some kind of faulty actuator-- you say "go forward!" and it foolishly decides to go left (or two squares forward, or remain still, or whatever.) For this problem, I'd expect it to be given to you, but you haven't shared it with us. And the summation over s' is telling you that when you start in s, and you pick an action a, you need to sum over all possible resulting s' states. Again, you need the details of that p(s,a,s') function to know what those are.

And last, there is r(s,a) which is the reward for doing action a in state s, regardless of where you end up. In this problem it might be slightly negative, to represent a fuel cost. If there is a series of rewards in the grid and a grab action, it might be positive. You need that, too.

So, what to do? Pick a state s, and calculate your policy for it. For each s, you're going have the possibility of (s,a1), and (s,a2), and (s,a3), etc. You have to find the a that gives you the biggest result. And of course for each pair (s,a) you may (in fact, almost certainly will) have multiple values of s' to stick in the summation.

If this sounds like a lot of work, well, that's why we have computers.

PS - Read the problem description very carefully to find out what happens if you run into a wall.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow