Question

What I understood for value iteration while coding is that we need to have a policy fixed. According to that policy the value function of each state will be calculated. Right?

But in policy iteration the policy will change from time to time. Am I right?

Was it helpful?

Solution

In policy iteration, you define a starting policy and iterate towards the best one, by estimating the state value associated with the policy, and making changes to action choices. So the policy is explicitly stored and tracked on each major step. After each iteration of the policy, you re-calculate the value function for that policy to within a certain precision. That means you also work with value functions that measure actual policies. If you halted the iteration just after the value estimate, you would have a non-optimal policy and the value function for that policy.

In value iteration, you implicitly solve for the state values under an ideal policy. There is no need to define an actual policy during the iterations, you can derive it at the end from the values that you calculate. You could if you wish, after any iteration, use the state values to determine what "current" policy is predicted. The values will likely not approximate the value function for that predicted policy, although towards the end they will probably be close.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top