PPO, A2C for continuous action spaces, math and code

https://datascience.stackexchange.com/questions/49625

01-11-2019
|

Question

Edit: Question has been edited to better reflect what I learned after asking the original question.

I implemented the clipped objective PPO-clip as explained here: https://spinningup.openai.com/en/latest/algorithms/ppo.html

Basically I used a dummy actor network to find the new action probability without training the local actor network.

"""use temp_actor to get new prob so we don't update the actual actor until
        we do the clip op"""
        curr_weights = self.actor.get_weights()
        self.temp_actor.set_weights(curr_weights)
        self.temp_actor.fit(state, advantages, epochs=1, verbose=0)
        new_policy = self.temp_actor.predict(state, batch_size=1).flatten()
        new_aprob = new_policy[action]

Then I worked out the ratio of action probabilities and implemented the PPO clipping parts of the algorithm:

ratio = new_aprob / old_aprob
        # scale = min(ratio * advantages, K.clip(ratio, 1 - self.epsilon, 1 + self.epsilon) * advantages)
        no_clip = ratio * advantages
        clipped = np.clip(ratio, 1 - self.epsilon, 1 + self.epsilon) * advantages

        self.actor.fit(state, np.minimum(no_clip, clipped), epochs=1, verbose=0)

The full code is here (please excuse some coarse language in the comments): https://github.com/nyck33/openai_my_implements/blob/master/cartpole/my_ppo_cartpole.py

How can I adapt it for a continuous action space problem such as Pendulum v0.
Update: I just read that the distribution of actions is a normal or Gaussian distribution here: Reddit normal distribution

That seems strange since I imagined the curve of actions to be skewed one way or another depending on whether a particular state would get better results with certain action tendencies.

Update: I found this on the Stable baselines site for PPO:

From this URI: ppo explanation

I also saw that section 13.7 in Sutton's RL book seems to be a must-read for this type of problem as the section is titled: Policy parameterization for Continuous Actions.

Furthermore I also learned in the process of searching for a readable implementation that GAE (advantage) is normalized so that probably makes an implementation more robust against wild fluctuations. I also think my implementation is incomplete, ie. missing other components compared to the Pytorch "solution" I linked to in the answer.

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange