What is the reward function in the 10 armed test bed?

https://datascience.stackexchange.com/questions/15415

reinforcement-learning

16-10-2019
|

Question

The Sutton & Barto book on reinforcement learning mentions the 10 armed test bed in chapter 2, Bandit Problems:

To roughly assess the relative effectiveness of the greedy and ε-greedy methods, we compared them numerically on a suite of test problems. This was a set of 2000 randomly generated n-armed bandit tasks with n = 10. For each bandit, the action values, $q_∗(a), a = 1, . . . , 10,$ were selected according to a normal (Gaussian) distribution with mean 0 and variance 1. On t^th time step with a given bandit, the actual reward $R_t$ was the $q_∗(A_t)$ for the bandit (where $A_t$ was the action selected) plus a normally distributed noise term that was mean 0 and variance 1 [. . . .] We call this suite of test tasks the 10-armed testbed.

What is the reward function in the 10 armed test bed ? I interpreted it as something like q*(a) + some normally random value. where q*(a) is the true value of action a.

Why is the reward function chosen this way in the test bed and how does the reward function affect the value estimations and the graphs ?

I guess the reason i'm asking this question is because i'm not completely clear how a reward looks like in the real world where i don't know anything about q*(a).

Solution

The reward function in the Chapter 2 test bed is simply the "true" mean value for the chosen action, plus a "noise term" which is normal distribution with mean 0, standard deviation 1.

The noise has the same distribution as the initial setting of "true" values. The difference is you set the true values at the start and do not change them, then add noise on evaluation of each reward. The goal for the learner is then to find the best "true" value whilst only seeing the reward.

This matches your understanding as I read it from the question. You could write it like this:

Initialisation:

$\forall a \in A: q_*(a) \leftarrow N(0,1)$

Evaluation:

$R_t = r(A_t) = q_*(A_t) + N(0,1)$

Where $N(\mu,\sigma)$ is a sample from the normal distribution, mean $\mu$, standard deviation $\sigma$

Why is the reward function chosen this way in the test bed and how does the reward function affect the value estimations and the graphs ?

For a bandit problem to be non-trivial, the reward function needs to be stochastic, such that it is not possible to immediately discover the best action, there should be some uncertainty on what the best action to take is, even after taking many samples.

So the noise is there to provide at least some difficulty - without it finding the best action would be a trivial $argmax$ over the 10 possible actions. The noise does not represent uncertainty in the sensing (although that could also be a real world issue), but variability of the environment in response to an action. The test examples could have almost any distribution (e.g. $p(-1.0|a=1) = 0.9, p(9.0|a=1) = 0.1$ for $q_*(a=1) = 0.0$), the authors made a choice that was concise to describe and useful for exploring the different techniques in the chapter.

The specific reward function will affect the learning graphs. The test bed has been chosen so that ratio of noise to magnitude of "true" values is high. In turn, this means that value estimates will converge relatively slowly (as a ratio to the true values), and this exposes differences between different sampling and estimation techniques when they are plotted by time step.

To answer your concern:

I guess the reason i'm asking this question is because i'm not completely clear how a reward looks like in the real world where i don't know anything about q*(a).

In the real world you may need to sense or receive the reward from the environment. Obviously that complicates the test scenario, and doesn't add anything to the understanding of the maths, so the test bed just generates some imaginary distribution for the environment inside the problem. The "sensing" in the test is just assumed with a reward amount defined by the test.

To qualify as a simple (and static) bandit problem, the reward has to be immediately apparent on taking the action, and have no dependency on current state or history. That constrains the problem somewhat - it is not the full reinforcement problem. So real-world examples tend to be about gambling with limited choice on independent repeatable events.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange