Pregunta

I am trying to learn about Markov decision problems and I was given the algorithm for Value Iteration, but I am confused how to turn them into actual C++ code. Mainly the parts where summations and such occur. Here is the algorithm:

function VALUE-ITERATION(P;R) returns a utility matrix
    inputs: P, a transition-probability matrix
            R, a reward matrix
    local variables: U, utility matrix, initially identical to R
                     U', utility matrix, initially identical toR
    repeat
         U <- U'
         for each state i do
             U'(s_i) <-  R(s_i) + max_a Summation_j P^a_ij*U(s_j)
         end
    until max_(s_i) |U(s_i) - U'(s_i)| < e
return U

This looks like hieroglyphics to me, is there a simpler algorithm that would be of more help to me? Or could somebody dumb it down for me?

¿Fue útil?

Solución

I found this article quite readily: Value iteration and policy iteration algorithms for Markov decision problem [PDF file]. It explains quite a bit more what's going on.

Conceptually, you have a system that can be in a number of states, rewards for transitions from one state to another, and actions that sometimes can result in state transitions. The basic idea is to keep iterating until you have arrived at a utility matrix that doesn't change That's what that final test max_(s_i) | U(s_i) - U'(s_i)| < e looks for. (Here, e is short for epsilon, a small number, and probably should be an additional input.)

For each iteration, you want to take the best action for each state. The best action is the one that yields the greatest reward, weighted by probability. That's what max_a Summation_j P^a_ij*U(s_j) does: Find the action that yields the best reward, weighted by probability.

Otros consejos

I can translate bits and pieces, but there's a lot of information in your code that only makes sense in context, and there's no way for us to know that context.
Also, it appears that some formatting was lost along the way, since P^a_ij looks like it was at one point P to the power of a_i times j. David seems to know how to interpret the crazy bit.
Also the condition loop uses | in the pseudo-code which is wierd, but I took it literally.

utility_matrix VALUE_ITERATION(const probability_matrix& P,
                               const reward_matrix& R)
{
    utility_matrix U(R);
    utility_matrix UP(R);
    do {
        U = UP;
        for(int s_i : ????) //for each state in what?
            UP[s_i] = R[s_i] + ???? //max_a Summation_j P^a_ij*U(s_j)
    while(max(s_i) ???? std::abs(U[s_i] - UP[s_i])<e);
    return U;
}

As akira said, the understandable bits are straightforward, if you couldn't do those, you might need to learn more about C++ before you tackle this.

As per your comment, I found C code that looks vaguely like your algorithm here. (Lines 62-76)

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top