Reinforcement Learning: Policy Gradient derivation question

https://datascience.stackexchange.com/questions/68238

09-12-2020
|

Question

I have been reading this excellent post: https://medium.com/@jonathan_hui/rl-policy-gradients-explained-9b13b688b146 and following the RL-videos by David Silver, and I did not get this thing:

For $\pi_\theta(\tau) = \pi_\theta(s_1, a_1, ..., s_T, a_T) = p(s_1) \prod_{t=1}^T \pi_\theta(a_t | s_t)p(s_{t+1}|a_t, s_t)$ being the likelihood of a given trajectory in a cycle, the derivative of the value function becomes $$\nabla_{\theta}J = E[\nabla_{\theta}log\pi_{\theta} \cdot r ]$$

which then immediately becomes

$$={{1} \over {N}} \sum_{i=1}^{N}(\sum_t^T \nabla_\theta log \pi_\theta(a_{i,t}, s_{i,t})) r$$

i.e. summed over all N paths for $\tau$, while I expected

$$=\sum_\tau \pi_\theta(\tau) \sum_t^T \nabla_\theta log \pi_\theta(\tau) r$$

What I do not get: Where does the probability for the trajectories $\pi_\theta(\tau)$ (left-most sum) go or why did it get replaced by the mean over all paths? Is it assumed that all trajectories are equally likely, given that you start from a known starting position?

(You can find the equations in the blog-post linked above, at the end of the chapter "Optimization", right before the chapter "Intuition".)

Solution

Actually in the article there is a $\approx$ rather than an $=$. This is because you can approximate expectation values by sampling the respective distribution.

Assume you want to compute

$$ E \left[ f(x) \right] = \int p(x) f(x) dx$$

The integral might not be tractable, you might not even know the distribution $p(x)$. But as long as you can sample from $p(x)$, you can approximate quite well using the Monte Carlo estimator

$$ E \left[ f(x) \right] \approx \frac{1}{N} \sum_{i=1}^N f(x_i)$$

with $x_i \sim p(x)$ for all $i$. This approximation would get better for larger $N$. The distribution $p(x)$ is, in some sense, represented by the samples $x_i$ and their respective frequency.

This is what is going on in the article. You want to compute the expectation value over all possible trajectories, but that is infeasible. Luckily though, you can sample the distribution by running simulations of the environment. The expectation is then approximated using samples of trajectories. The $\pi_\theta(\tau)$ is represented in the sampled $a_{i,t}$ and $s_{i,t}$.

In short, the expectation over all possible paths is approximated by the mean over samples of paths.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange