Question

Why the termination condition of value-iteration algorithm ( example http://aima-java.googlecode.com/svn/trunk/aima-core/src/main/java/aima/core/probability/mdp/search/ValueIteration.java )

In the MDP (Markov Desicion Process) is

||Ui+1-Ui||< error*(1-gamma)/gamma, where

Ui is vector of utilities
Ui+1 updated vector of utilities

error -error bound used in algorithm

gamma-discount factor used in algorithm

Where does "error*(1-gamma)/gamma" come from? "divided by gamma" is because every step is discounted by gamma? But error*(1-gamma)? And how big must be an error?

Was it helpful?

Solution

That's called a Bellman Error or a Bellman Residual.

See Williams and Baird, 1993 for use in MDPs.

See Littman, 1994 for use in POMDPs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top