Reinforcement learning, pendulum python

https://datascience.stackexchange.com/questions/16961

reinforcement-learning

22-10-2019
|

Question

I'm having trouble finding a good reward function for the pendulum problem, the function I'm using: $-x^2 - 0.25*(\text{xdot}^2)$ which is the quadratic error from the top. with $x$ representing the current location of the pendulum and $\text{xdot}$ the angular velocity.

It takes a lot of time with this function and sometimes doesn't work. Any one have some other suggestions? I've been looking in google but didn't find anything i could use

Solution

You could use the same reward function that Openai's Inverted Pendulum is using:

$costs=-(\Delta_{2\pi}\theta)^2 - 0.1(\dot{\theta})^2 - 0.001u^2$

where $(\Delta_{2\pi}\theta)$ is the difference between current and desired angular position performed using modulo $2\pi$. The variable $u$ denotes the torque (the action of your RL agent). The optimal is to be as close to zero costs as it gets.

The idea here is that you have a control problem in which you can come up with a quadratic 'energy' or cost function that tells you the cost of performing an action at EVERY single time step. In this paper (p.33 section 5.2) you can find a detailed description.

I have tested RL algorithms in this objective function and I did not encounter any problems for convergence in both MATLAB and Python. If you still have problems let us know what kind of RL approach you implemented and how you encoded the location of the pendulum.

Hope it helps!

OTHER TIPS

In reinforcement learning, you should avoid scoring interim results based on heuristics. Unlike supervised learning, or a search algorithm, you are not trying to guide the behaviour, just reward good results. For an inverted pendulum a good result might simply be "has not fallen over so far", although there is nothing inherently wrong with a cost function which expresses cost in terms of minimising differences from an ideal, you do have to take more care with the values used.

Assuming you are using discounting, and continuous (not episodic) approach, then reward can be 0 for "not falling over" and -1 for "it fell over", followed by a re-set/continue. You can check for falling by measuring whether the pendulum has reached some large angle to the vertical (e.g. 45 degrees or more).

For an episodic approach, it is more natural to have +1 "ok" and 0 for the end state "fell over", although the 0/-1 scheme would also work. However, you want to avoid having negative values for any state which is "ok", because that is basically telling the agent to hurry up and end the episode. In your case, ending the episode is bad, so you don't want that.

If you do want to reward "perfection" in your episodic approach, then your formula might work better if you added a positive offset, so that the agent has an incentive to continue the episode if possible. You should choose a value such that recoverable states are positive.

Note that the above analysis applies only to certain episode-based approaches. It depends critically on what you count as an episode, and whether the agent is able to take an action which ends the episode.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange