Question

In any of the standard Reinforcement learning algorithms that use generalized temporal differencing (e.g. SARSA, Q-learning), the question arises as to what values to use for the lambda and gamma hyper-parameters for a specific task.

I understand that lambda is tied to the length of the eligibility traces and gamma can be interpreted as how much to discount future rewards, but how do I know when my lambda value is too low for a given task, or my gamma too high?

I realize these questions don't have well defined answers, but knowing some 'red flags' for having inappropriate values would be very useful.

Take the standard cart-pole, or inverted pendulum task for example. Should I set gamma to be high, since it requires many steps to fail the task, or low because the state information is completely Markovian? And I can't even fathom rationals for lambda values...

Was it helpful?

Solution

Gamma

Typically, gamma is viewed as part of the problem, not of the algorithm. A reinforcement learning algorithm tries for each state to optimise the cumulative discounted reward:

r1 + gamma*r2 + gamma^2*r3 + gamma^3*r4 ...

where rn is the reward received at time step n from the current state. So, for one choice of gamma the algorithm may optimise one thing, and for another choice it will optimise something else.

However, when you have defined a certain high-level goal, there still often remains a modelling choice, as many different gamma's might satisfy the requirements of the goal. For instance, in the cart pole the goal is to balance the pole indefinitely. If you give a reward of +1 for every step that it is balanced, the same policies (the ones that always balances the pole) are optimal for all gamma > 0. However, the ranking of suboptimal policies - that determine the learning properties towards this goal - will be different for different values of gamma.

In general, most algorithms learn faster when they don't have to look too far into the future. So, it sometimes helps the performance to set gamma relatively low. A general rule of thumb might be: determine the lowest gamma min_gamma that still satisfies your high-level goal, and then set the gamma to gamma = (min_gamma + 1)/2. (You don't want to use gamma = min_gamma itself, since then some suboptimal goal will be deemed virtually as good as the desired goal.) Another useful rule of thumb: for many problems a gamma of 0.9 or 0.95 is fine. However, always think about what such a gamma means for the goal you are optimising when combined with your reward function.

Lambda

The lambda parameter determines how much you bootstrap on earlier learned value versus using the current Monte Carlo roll-out. This implies a trade-off between more bias (low lambda) and more variance (high lambda). In many cases, setting lambda equal to zero is already a fine algorithm, but setting lambda somewhat higher helps speed up things. Here, you do not have to worry about what you are optimising: the goal is unrelated to lambda and this parameter only helps to speed up learning. In other words, lambda is completely part of the algorithm and not of the problem.

A general rule of thumb is to use a lambda equal to 0.9. However, it might be good just to try a few settings (e.g., 0, 0.5, 0.8, 0.9, 0.95 and 1.0) and plot the learning curves. Then, you can pick whichever seems to be learning the fastest.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top