Q-Learning convergence to optimal policy

https://stackoverflow.com/questions/23078806

03-07-2023
|

Question

I am using rlglue based python-rl framework for q-learning. My understanding is that over number of episodes, the algorithm converges to an optimal policy (which is a mapping which says what action to take in what state).

Question1: Does this mean that after a number of episodes ( say 1000 or more ) I should essentially get the same state:action mapping?

When I plot the rewards (or rewards averaged over 100 episodes) I get a graph similar to Fig 6.13 in this link.

Question2: If the algorithm has converged to some policy why does the rewards fall down? Is there a possibility that the rewards vary drastically?

Question3: Is there some standard method which I can use to compare the results of various RL algorithms?

Solution

Q1: It will converge to a single mapping, unless more than one mapping is optimal.

Q2: Q-Learning has an exploration parameter that determines how often it takes random, potentially sub-optimal moves. Rewards will fluctuate as long as this parameter is non-zero.

Q3: Reward graphs, as in the link you provided. Check http://rl-community.org.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow