Use of classical back propagation neural network with TD-learning in board game

https://stackoverflow.com/questions/11327923

19-06-2021
|

Question

I want to ask if it is senseful using a standard backpropagation neural network with TD-learning method in a board game?

My method looks like:

Play 1 game. Net is playing as both players with greedy policy and random moves sometimes.

For each stored game position (starting from terminal-1 and move to starting position) calculate estimated position value and desired position value, e.g.

boards_values[i]['desired_value'] = boards_values[i]['estimated_value'] + 0.4 * ( boards_values[i+1]['estimated_value'] - boards_values[i]['estimated_value'] )

Create training patterns for net from entire game end train each with small learning rate for 1 epoch with standard back propagation algorithm.
```
NN.train([pattern], iterations=1, N=0.001, M=0.000001)
```

I tried some combinations of the above (learning not from one example but 30-40 patterns, decreasing/increasing learning speed and so on) in my tic tac toe game and never trained ideal player (it should never lose vs random). One of best example when NN agent play against random player is:

(play as first: win, tie, lose), (play as second: win, tie, lose), (sum: win, tie, lose)
(191, 34, 275), (159, 102, 239), (350, 136, 514) - fresh net
(427, 21, 52), (312, 16, 172), (739, 37, 224) - after +50k games

Input is 18 neurons in format:
for each board cell set (1,0) for x, (0,0) for empty cell and (0,1) for o. Output is one unit win/lose probability estimation in range -1, 1.

Tic tac toe is only testing sandbox, when I finish it with success I will move to a more complex card game ('Lost Cities').

Solution

Yes, this is relatively standard. It was the approach taken by Tesauro in his program TDGammon 2.1 which trained an artificial neural network to play backgammon better than the best human players (after bootstrapping on 1.5 million games).

There are many caveats, however:

Artificial neural networks are notoriously difficult to use correctly. Have you ensured that your implementation performs as expected, by testing it on some simple supervised learning problems?
TDGammon used a neural network to give heuristic utilities for each game state, and combined this with a 2-ply alpha/beta pruning algorithm. With modern computers it is possible to use a deeper look-ahead (for example, I recently coded up an alpha/beta search algorithm that easily manages 10-ply search on a game with a branching factor of 7, on interpreted (non-compiled) code and before taking heuristics into account).
TD Learning is not the only reinforcement learning algorithm. I have had success in the past applying SARSA and Q-Learning, which speed up search by preferentially exploring strategies which appear promising and ignoring strategies which look bad. You need to combine them with an exploration policy to ensure that they sometimes explore strategies that look bad, to avoid getting stuck in local minima. A simple policy such as epsilon-greedy with ε = 0.1 often works well.
Eligibility traces are a powerful way of speeding up learning in reinforcement learning algorithms. Algorithms that use eligibility traces include TD(λ), SARSA(λ) and Q(λ). You need to be careful, though - there is now another parameter to fit, which means that it's even more important to take care when training your model. Use a test set!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow