Question

I am currently trying to understand how TD-Gammon works and have two questions:

1) I found an article which explains the weight update. It consists of three part. The last part is an differentiation of V(s) with respect to w. In the text it is called a "running sum". How do I calculate that value? (I'm only interested in the weight changes from the output to the hidden layer, not in further weight changes)

2) After having read this procedure of updating the weights, there has one question arised: Why don't we just create a target value for a state using reinforcement learning and give that value to our neural network, so that it learns to return that value for the current state? Why is there an extra updating rule directly manipulating the weights?

Était-ce utile?

La solution

Really, you just need to implement an ANN which uses the basic, usual sum of squares error. Then, replace the target network outputs with the TD-error value: E = r + gamma*V(t+1) - V(t)

From there, you can just use the typical ANN backprop weight update rule.

So, in short, I think your description is actually what a RL via ANN algorithm should do. It is training the ANN to learn the state/action value function.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top