PPO, A2C for continuous action spaces, math and code
-
01-11-2019 - |
Frage
Edit: Question has been edited to better reflect what I learned after asking the original question.
I implemented the clipped objective PPO-clip as explained here: https://spinningup.openai.com/en/latest/algorithms/ppo.html
Basically I used a dummy actor network to find the new action probability without training the local actor network.
"""use temp_actor to get new prob so we don't update the actual actor until
we do the clip op"""
curr_weights = self.actor.get_weights()
self.temp_actor.set_weights(curr_weights)
self.temp_actor.fit(state, advantages, epochs=1, verbose=0)
new_policy = self.temp_actor.predict(state, batch_size=1).flatten()
new_aprob = new_policy[action]
Then I worked out the ratio of action probabilities and implemented the PPO clipping parts of the algorithm:
ratio = new_aprob / old_aprob
# scale = min(ratio * advantages, K.clip(ratio, 1 - self.epsilon, 1 + self.epsilon) * advantages)
no_clip = ratio * advantages
clipped = np.clip(ratio, 1 - self.epsilon, 1 + self.epsilon) * advantages
self.actor.fit(state, np.minimum(no_clip, clipped), epochs=1, verbose=0)
The full code is here (please excuse some coarse language in the comments): https://github.com/nyck33/openai_my_implements/blob/master/cartpole/my_ppo_cartpole.py
How can I adapt it for a continuous action space problem such as Pendulum v0.
Update: I just read that the distribution of actions is a normal or Gaussian distribution here: Reddit normal distribution
That seems strange since I imagined the curve of actions to be skewed one way or another depending on whether a particular state would get better results with certain action tendencies.
Update: I found this on the Stable baselines site for PPO:
From this URI: ppo explanation
I also saw that section 13.7 in Sutton's RL book seems to be a must-read for this type of problem as the section is titled: Policy parameterization for Continuous Actions.
Furthermore I also learned in the process of searching for a readable implementation that GAE (advantage) is normalized so that probably makes an implementation more robust against wild fluctuations. I also think my implementation is incomplete, ie. missing other components compared to the Pytorch "solution" I linked to in the answer.
Keine korrekte Lösung