AlphaGo (and other game programs using reinforcement-learning) without human database

https://datascience.stackexchange.com/questions/11118

reinforcement-learning

16-10-2019
|

Question

I am not a specialist of the subject, and my question is probably very naive. It stems from an essay to understand the powers and limitation of reinforcement learning as used in the AlphaGo program.

The program AlphaGo has been built using, among other things (Monte-Carlo exploration of trees, etc.), neural networks which are trained from a huge database of human-played go games, and which are then reinforced by letting play versions of the program against itself many times.

Now I wonder what would happen is we tried to build such a program without human database, i.e. starting with a basic program of Go just knowing rules and some method to explore trees, and letting play against itself to improve its neural network. Will we, after many games against itself, arrive at a program able to compete with or beat the best human players? And if so, how many games (in order of magnitude) would be needed for that? Or on the contrary, will such a program converge toward a much weaker player?

I assume that the experiment has not been made, since AlphaGo is so recent. But the answer may nevertheless be obvious to a specialist. Otherwise any educated guess will interest me.

One can also ask the same question for "simpler" games. If we use roughly the same reinforcement-learning technics used for AlphaGo, but with no use of human database, for a Chess program, would we eventually get a program able to beat the best human? And if so, how fast? Has this been tried? Or if not for Chess, what about Checkers, or even simpler games?

Thanks a lot.

Solution

I'm no expert but it looks like AlphaGo Zero answers your question. https://deepmind.com/blog/alphago-zero-learning-scratch/

Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play. In doing so, it quickly surpassed human level of play and defeated the previously published champion-defeating version of AlphaGo by 100 games to 0.

OTHER TIPS

The same question has been asked to the author of the AlphaGo paper and his answer was that we don't know what would happen if AlphaGo would learn from scratch (they haven't tested it).

However, given the complexity of the game, it would be a difficult task to train an algorithm from scratch without prior knowledge. Thus, it is reasonable at the beginning to start building such a system by upgrading it to a Master level using knowledge acquired by humans.

It is worth noting that, although the human moves bias the action selection at the tree nodes (states), this prior has a decay factor. This means that increased visitations to a specific state, reduce the strength of the prior to encourage the algorithm to explore.

The current level of Mastery of AlphaGo is unknown how close or far it is to a human's way of playing (in the tournament it did one move that a human had almost zero probability to perform!- but equally did some really bad moves as well). Possibly it remains for all these questions to be answered by actually implementing the corresponding testing algorithms.

I owe to edit my answer as the recent paper of DeepMind answers your question. There were lots of advancements that came out from the whole previous experience with the first version of AlphaGo and it is really worth reading it.

As far as I understood the algorithm of AlphaGo, it is based on a simple reinforcement learning (RL) framework, using Monte-Carlo tree search to select the best actions. On the top of it, the states and actions covered by the RL algorithm are not simply the entire possible configuration of the game (Go has a huge complexity) but are based on a policy network and a value network, learned from real games and then improved by playing games AlphaGo vs AlphaGo.

Then we might wonder if the training from real games is just a shortcut to save time or a necessary option to get such efficiency. I guess no one really know the answer, but we could state some assumptions. First, the human ability to promote good moves is due to much more complex intelligence than a simple neural net. For board games, it is a mix between memory, experience, logic and feelings. In this direction, I'm not sure the AlphaGo algorithm could build such a model without explicitly exploring a huge percentage of the entire configuration of the Go game (which is practically impossible). Current researches focus on building more complex representation of such a game, like relational RL or inductive logic learning. Then for simpler games (might be the case for chess but nothing sure), I would say that AlphaGo could retrieve similar techniques as humans by playing against itself, especially for openings (there are first only 10 moves available).

Still it is only an opinion. But I'm quite sure that the key to answer your question resides in the RL approach that is nowadays still quite simple in term of knowledge. We are not really able to identify what makes us able to handle these games, and the best way we found until yet to defeat human is to roughly learn from him, and improve (a bit) the learned model with massive calculations.

Competitive self-play without human database is even possible for complicated, partially observed environments. OpenAI is focusing on this direction. According to this article:

Self-play ensures that the environment is always the right difficulty for an AI to improve.

That's an important reason for the success of self-play.

OpenAI achieved superhuman results for Dota 2 1v1, in August 11th 2017, beat Dendi 2-0 under standard tournament rules.

The bot learned the game from scratch by self-play, and does not use imitation learning or tree search. This is a step towards building AI systems which accomplish well-defined goals in messy, complicated situations involving real humans.

Not just games, this direction is also promising for robotics tasks.

We’ve found that self-play allows simulated AIs to discover physical skills like tackling, ducking, faking, kicking, catching, and diving for the ball, without explicitly designing an environment with these skills in mind.

In the next step, they extend the method to learn how to cooperate, compete and communicate, not just limit to self-play.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange