I found this article quite readily: Value iteration and policy iteration algorithms for Markov decision problem [PDF file]. It explains quite a bit more what's going on.
Conceptually, you have a system that can be in a number of states, rewards for transitions from one state to another, and actions that sometimes can result in state transitions. The basic idea is to keep iterating until you have arrived at a utility matrix that doesn't change That's what that final test max_(s_i) | U(s_i) - U'(s_i)| < e
looks for. (Here, e
is short for epsilon, a small number, and probably should be an additional input.)
For each iteration, you want to take the best action for each state. The best action is the one that yields the greatest reward, weighted by probability. That's what max_a Summation_j P^a_ij*U(s_j)
does: Find the action that yields the best reward, weighted by probability.