Question

I try to build a policy gradient RL machine, and let's look at the REINFORCE's equation for updating the model parameters by taking a gradient to make the ascent (I apologize if notation is slightly non-conventional):

$$\omega = \omega + \alpha \cdot \nabla_\omega log\ \pi(A_t|S_t) * V_t$$

My questions I am unsure about are the following:

  1. Do I calculate the gradient values at each time step $t$ (like in SGD fashion) or averaging gradient over all timesteps of the episode is a better option?
  2. Do I care about getting the gradient values of the selected action probability output only (ignoring the outputs for other actions, in a discrete case)? In other words, do I consider the $V_t$ term for non-selected actions to be 0, which make the gradient values equal 0 as well?
  3. In a discrete case the cross-entropy (the loss) is defined as: $$H(p,q) = -\sum_x P(x) * log Q(x)$$

    (source: wikipedia)

    Does that mean that if I substitute the labels (denoted as $P(x)$) with the $V_t$ terms (non-zero for selected action only) in my neural network training, I will be getting the correct gradient values of the log-loss which fully satisfy the REINFORCE definition?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top