Pergunta

I was working with continuous system RL and obviously stumbled across this Policy Gradient.

I want to know is this something like cost function for RL? It kinda gives that impression considering we are finding out how efficient the system is as a whole (weighted sum of rewards multiplied by all the policies).

Let's take the example of vanilla PG:

$$g = \mathbb E\Big[\sum R_t*\frac{(\partial)}{(\partial\theta)}ln\pi_\theta(a_t|s_t)\Big]$$

Here, the gradient is nothing but the expected value of the Return (which is nothing but the discounted sum of all the reward) multiplied by the how the policy (the network output) needs to change according to the network weights.

This seems similar to a cost function where we use the total error using cross-entropy (something similar to the information given by return) and then we use this to see how the weights of the neural network can be changed through backpropagation.

Let me know if I've got this right.

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top