How can I apply reinforcement learning to continuous action spaces?

https://stackoverflow.com/questions/7098625

24-12-2020
|

Question

I'm trying to get an agent to learn the mouse movements necessary to best perform some task in a reinforcement learning setting (i.e. the reward signal is the only feedback for learning).

I'm hoping to use the Q-learning technique, but while I've found a way to extend this method to continuous state spaces, I can't seem to figure out how to accommodate a problem with a continuous action space.

I could just force all mouse movement to be of a certain magnitude and in only a certain number of different directions, but any reasonable way of making the actions discrete would yield a huge action space. Since standard Q-learning requires the agent to evaluate all possible actions, such an approximation doesn't solve the problem in any practical sense.

Solution

The common way of dealing with this problem is with actor-critic methods. These naturally extend to continuous action spaces. Basic Q-learning could diverge when working with approximations, however, if you still want to use it, you can try combining it with a self-organizing map, as done in "Applications of the self-organising map to reinforcement learning". The paper also contains some further references you might find useful.

OTHER TIPS

Fast forward to this year, folks from DeepMind proposes a deep reinforcement learning actor-critic method for dealing with both continuous state and action space. It is based on a technique called deterministic policy gradient. See the paper Continuous control with deep reinforcement learning and some implementations.

There are numerous ways to extend reinforcement learning to continuous actions. One way is to use actor-critic methods. Another way is to use policy gradient methods.

A rather extensive explanation of different methods can be found in the following paper, which is available online: Reinforcement Learning in Continuous State and Action Spaces (by Hado van Hasselt and Marco A. Wiering).

For what you're doing I don't believe you need to work in continuous action spaces. Although the physical mouse moves in a continuous space, internally the cursor only moves in discrete steps (usually at pixel levels), so getting any precision above this threshold seems like it won't have any effect on your agent's performance. The state space is still quite large, but it is finite and discrete.

I know this post is somewhat old, but in 2016, a variant of Q-learning applied to continuous action spaces was proposed, as an alternative to actor-critic methods. It is called normalized advantage functions (NAF). Here's the paper: Continuous Deep Q-Learning with Model-based Acceleration

Another paper to make the list, from the value-based school, is Input Convex Neural Networks. The idea is to require Q(s,a) to be convex in actions (not necessarily in states). Then, solving the argmax Q inference is reduced to finding the global optimum using the convexity, much faster than an exhaustive sweep and easier to implement than other value-based approaches. Yet, likely at the expense of a reduced representation power than usual feedforward or convolutional neural networks.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow