Is this a Q-learning algorithm or just brute force?

https://datascience.stackexchange.com/questions/28915

31-10-2019
|

Question

I have been playing with an algorithm that learns how to play tictactoe. The basic pseudocode is:

repeat many thousand times {
  repeat until game is over {
    if(board layout is unknown or exploring) {
      move randomly
    } else {
      move in location which historically gives highest reward
    }
  }

  for each step in the game {
    determine board layout for current step
    if(board layout is unknown) {
      add board layout to memory
    }
    update reward for board layout based on game outcome
  }
}

now play a human and win :-)

Exploration: in the beginning the algorithm explores aggresively, and this reduces linearly. After say a thousand games it only explores in 10% of the moves. All other moves are based on exploitation of previous rewards.

Rewards: if the game resulted in a win, then award 10 points. If the game resulted in a draw, 0 points, otherwise -5 points. Actually, these rewards can be "tuned", so that if the game was shorter and it was won, then award more points or if it was longer award less points. This way the algorithm prefers winning quickly. That means that it learns to win as soon as possible, rather than aiming to win later on. That is important so that it doesn't miss winning immediately - if it missed such a move the opponent would likely a) move there to avoid letting the AI win next time, and b) think the algorithm was stupid because it missed an "obvious" win.

This algorithm does indeed learn, so I can class it as a maching learning algorithm.

I think, but I am not sure, that it is a reinforced learning algorithm. However, according to https://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlearning.html it is not temporal difference learning, because it doesn't estimate the rewards until the end, and it should be estimating the reward as it goes along. That might mean that it is not reinforced learning.

Question 1: Can I successfully argue that I am estimating the reward based on history, and still claim the algorithm is reinforced learning or even Q-learning?

Question 2: If I replace the reward lookup which is based on the board layout, with a neural network, where the board layout is the input and the reward is the output, could the algorithm be regarded as deep reinforcement learning?

Question 3: I'd don't think that I have either a learning rate or a discount factor. Is that important?

I noticed that the algorithm is pretty useless unless I train it with at least every move which an opponent tries. So in a way it feels like it's using brute force rather than really "learning". This makes me question whether or not machine learning tictactoe is really learning. I agree that using a neural network to learn image recognition can be classed as learning because when it sees an unknown image it is able to state its classification. But that is quite useless for games like tictactoe where similar looking board layouts are totally unrelated (one may lead to a win, the other may lead to a loss). So...

Question 4: Can tictactoe algorithms be classed as real learning rather than simply brute force?

Update: regarding rewards... when the algorithm is deciding where to go, it works out the reward for each position as follows:

var total = winRewards + drawRewards + lossRewards;
move.reward = (100*(winRewards/total)) + (10*(drawRewards/total)) + (-1*(lossRewards/total));

I divide by the total number of points (for each move), because otherwise it seems to learn that one place is GREAT and doesn't give other ones a chance. This way, we work out the win ratio regardless of how often it's been played. It's normalised in comparison to the others.

The code is available here: https://github.com/maxant/tictactoe/blob/master/ai.js

UPDATE #2: I have since figured out that this algorithm cannot be classed as using brute force because it doesn't actually learn that many games before becoming an expert. Details here: http://blog.maxant.co.uk/pebble/2018/04/11/1523468336936.html

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange