Question

When explaining advantage function, it is usually claimed that using a baseline reduces the variance. I have not found any specific reference to justify this.

Is this an application of control variates or something similar?

Could anyone provide some reference or formal justification for the variance reduction?

Was it helpful?

Solution

I assume that you are referring to policy gradient estimates. Adding any kind of function to your policy estimation, which is dependent on the state of the environment, first of all, does not bias your gradient estimator (Proof Here).

The basic idea of subtracting a baseline from your action value function (and thus forming the advantage function) is that an unbiased estimator of your policy gradient is still unbiased if a constant is subtracted from that estimator. Then, that constant can be chosen appropriately in order to reduce the variance of the new estimator by optimization. If you have access, you can find a very good explanation in the book Statistical Reinforcement Learning: Modern Machine Learning Approaches in section 7.2.2. Also [2] and section 3 in [3].

As you mention it can be viewed as a control covariate addition [4] which is used to reduce variance in Monte Carlo estimations. A good choice for that function is to use the usual value function ($V(s)$) which reduces the variance of your estimate.

Hope it helps!

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top