Training a Neural Network with Reinforcement learning

https://stackoverflow.com/questions/10722064

10-06-2021
|

Question

I know the basics of feedforward neural networks, and how to train them using the backpropagation algorithm, but I'm looking for an algorithm than I can use for training an ANN online with reinforcement learning.

For example, the cart pole swing up problem is one I'd like to solve with an ANN. In that case, I don't know what should be done to control the pendulum, I only know how close I am to the ideal position. I need to have the ANN learn based on reward and punishment. Thus, supervised learning isn't an option.

Another situation is something like the snake game, where feedback is delayed, and limited to goals and anti-goals, rather than reward.

I can think of some algorithms for the first situation, like hill-climbing or genetic algorithms, but I'm guessing they would both be slow. They might also be applicable in the second scenario, but incredibly slow, and not conducive to online learning.

My question is simple: Is there a simple algorithm for training an artificial neural network with reinforcement learning? I'm mainly interested in real-time reward situations, but if an algorithm for goal-based situations is available, even better.

Solution

There are some research papers on the topic:

And some code:

Code examples for neural network reinforcement learning.

Those are just some of the top google search results on the topic. The first couple of papers look like they're pretty good, although I haven't read them personally. I think you'll find even more information on neural networks with reinforcement learning if you do a quick search on Google Scholar.

OTHER TIPS

If the output that lead to a reward r is backpropagated into the network r times, you will reinforce the network proportionally to the reward. This is not directly applicable to negative rewards, but I can think of two solutions that will produce different effects:

1) If you have a set of rewards in a range rmin-rmax, rescale them to 0-(rmax-rmin) so that they are all non-negative. The bigger the reward, the stronger the reinforcement that is created.

2) For a negative reward -r, backpropagate a random output r times, as long as it's different from the one that lead to the negative reward. This will not only reinforce desirable outputs, but also diffuses or avoids bad outputs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow