TD(λ) in Delphi/Pascal (Temporal Difference Learning)

https://stackoverflow.com/questions/4845489

27-10-2019
|

Question

I have an artificial neural network which plays Tic-Tac-Toe - but it is not complete yet.

What I have yet:

the reward array "R[t]" with integer values for every timestep or move "t" (1=player A wins, 0=draw, -1=player B wins)
The input values are correctly propagated through the network.
the formula for adjusting the weights:

enter image description here

What is missing:

the TD learning: I still need a procedure which "backpropagates" the network's errors using the TD(λ) algorithm.

But I don't really understand this algorithm.

My approach so far ...

The trace decay parameter λ should be "0.1" as distal states should not get that much of the reward.

The learning rate is "0.5" in both layers (input and hidden).

It's a case of delayed reward: The reward remains "0" until the game ends. Then the reward becomes "1" for the first player's win, "-1" for the second player's win or "0" in case of a draw.

My questions:

How and when do you calculate the net's error (TD error)?
How can you implement the "backpropagation" of the error?
How are the weights adjusted using TD(λ)?

Thank you so much in advance :)

Solution

If you're serious about making this work, then understanding TD-lambda would be very helpful. Sutton and Barto's book, "Reinforcement Learning" is available for free in HTML format and covers this algorithm in detail. Basically, what TD-lambda does is create a mapping between a game state and the expected reward at the game's end. As games are played, states that are more likely to lead to winning states tend to get higher expected reward values.

For a simple game like tic-tac-toe, you're better off starting with a tabular mapping (just track an expected reward value for every possible game state). Then once you've got that working, you can try using a NN for the mapping instead. But I would suggest trying a separate, simpler NN project first...

OTHER TIPS

I have been confused about this too, but I believe this is the way it works:

Starting from the end node, you check R, (output received) and E, (output expected). If E = R, it's fine, and you have no changes to make.

If E != R, you see how far off it was, based on thresholds and whatnot, and then shift the weights or threshold up or down a bit. Then, based on the new weights, you go back in, and guess whether or not it was too high, or too low, and repeat, with a weaker effect.

I've never really tried this algorithm, but that's basically the version of the idea as I understand it.

As far as I remember you do the training with a known result set - so you calculate the output for a known input and subtract your known output value from that - that is the error.

Then you use the error to correct the net - for a single layer NN adjusted with the delta rule I know that an epsilon of 0.5 is too high - something like 0.1 is better - slower but better. With backpropagation it is a bit more advanced - but as far as I remember the math equation description of a NN is complex and hard to understand - it is not that complicated.

take a look at http://www.codeproject.com/KB/recipes/BP.aspx

or google for "backpropagation c" - it is probably easier to understand in code.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow