Question

This is another question I have on a q learning neural network being used to win tic tac toe, which is that im not sure i understand when to actually back propogate through the network.

What i am currently doing is when the program plays through the game, if the number of game sets recorded has reached the max amount, every time the program makes a move, it will pick a random game state from its memory and back propagate using that game state and reward. this will then continue every time the program makes a move as the replay memory will always be full from then on.

The association between rewards and game state and action from history, is that when a game has been completed, and the rewards have been calculated for each step, meaning that the total reward per step has been calculated, the method i use to calculate the reward is:

Q(s,a) += reward * gamma^(inverse position in game state)

in this case, gamma is a value predetermined to reduce the amount that the reward is taken into account the further you go back, and the inverse position in game state means that if there have been 5 total moves in a game, then the inverse position in game state when changing the reward for the first move would be 5, then for the second, 4, third 3 and so on. this just allows the reward to be taken less into account the earlier the move is.

Should this allow the program to learn correctly?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top