Question

I want to implement a reinforcement learning connect four agent. I am unsure how to do so and how it should look. I am familiar with the theoretical aspects of reinforcement learning but don't know how they should be implemented.

How should it be done? Should I use TD(lambda) or Q-learning, and how do MinMax trees come in to this? How does my Q and V functions work (Quality of action and Value of state). How do I score those things? What is my base policy which I improve, and what is my model? Another thing is how should I save the states or statesXactions (depending on the learning algorithm). Should I use neural networks or not? And if yes, how?

I am using JAVA.

Thanks.

Was it helpful?

Solution

This might be a more difficult problem than you think, and here is why:

The action space for the game is the choice of column to drop a piece into. The state space for the game is an MxN grid. Each column contains up to M pieces distributed among the 2 players.This means there are (2M+1-1)N states. For a standard 6x7 board, this comes out to about 1015. It follows that you cannot apply reinforement learning to the problem directly. The state value function is not smooth, so naíve function approximation would not work.

But not all is lost. For one thing, you could simplify the problem by separating the action space. If you consider the value of each column separately, based on the two columns next to it, you reduce N to 3 and the state space size to 106. Now, this is very manageable. You can create an array to represent this value function and update it using a simple RL algorithm, such as SARSA.

Note, that the payoff for the game is very delayed, so you might want to use eligibility traces to accelerate learning.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top