Question

After many research, I still can't find a neat answer about this question:

Let's assume 'lo' is our loss for a state-action pair calculated with the Bellman eq. I don't understand wich one here is correct:

  1. Should I backprop the same loss for every output Q(s,a) in my network?

  2. Should I ONLY backpropagate the loss for the specific output neuron I chose an action from?(not backprop the rest of the output neurons. Meaning that if we choose action 3 in for example 10 possible actions, we only backprop from the output neuron 3).

  3. Should I calculate for every Q(sn,an) it's Q*(sn,an) and each time backpropagate the loss of these 2? This is not correct as far as I understood.

Thnx for helping me out!

Was it helpful?

Solution

I think you're over-thinking this. Your Q network is simply a function approximator that you're using for regression. Upon a transition (s, a, r, s'), your input is Q(s, a) and your label is (r + gamma * max Q'(s', a')), where the max is over the actions a' and Q' is your target Q network. You calculate your loss between the input and the label, and simply backpropagate. I'm assuming you're using some autograd library, so you don't really need to worry much more about it. However, if you want to know what your gradients look like, remember your input Q(s, a) depends only on one of the output neurons (the one corresponding to a). So your gradients only flow through paths that pass through that neuron.

Licensed under: CC-BY-SA with attribution
scroll top