Question

Assume we work with neural networks, with the policy gradients method. The gradient w.r.t to the objective function $J$, is an expectation.

In other words, to get this gradient $\nabla_{\theta} J(\theta)$, we sample N trajectories, then average out their gradient contribution to obtain a more precise value that can "begin flowing into our network" during backprop. $$\nabla_{\theta} J(\theta) \approx \frac1N \sum_{i=1}^N \left( \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) \right) \left( \sum_{t=1}^T r(s_{i,t}, a_{i,t}) \right)$$

Looking with more detail at some specific trajectory $i$, we sum the gradient at each timestep in this trajectory, then multiply by the the total reward obtained from running that trajectory.

Assuming we use softmax as the network's final layer, the values for each action belongs to the range $[0,1]$

If I remember correctly, the gradient wrt $\theta$ at each timestep of a trajectory is going to be: $$\nabla_{\theta}\log \pi_{\theta}(a_{t}|s_{t}) = \frac{1}{\pi_{\theta}(a_{t}|s_{t})} \nabla \pi_{\theta}(a_{t}|s_{t})$$

where $\nabla \pi_{\theta}(a_{t}|s_{t})$ is then simply a derivative of softmax wrt its inputs, with the rest of the backprop towards the weight $\theta$

Question:

Let's say at timestep $t_2$ our softmax has output something like this:

[0.5,  0.1,  0.1,  0.1,  0.1,  0.1]

and the total reward for the trajectory was 50

Ignoring the rest of the backprop at the current timestep, $\frac{1}{\pi_{\theta}(a_{t}|s_{t})}$ will give us:

[2,  10,  10,  10,  10,  10] 

This means we are already favoring other actions instead of the first action. This seems counter-intuitive to me. What if we only got such a large large reward because we took the first action? But the formula encourages us to strengthen the other 4 actions.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top