Question

Context

I am confused about how a DQN is supposed to solve the cart pole problem since the rewards are so dense. I have been using pytorch example. I am aware of some solutions, but I have issue with the basic principle of the env.

Unlike the tutorials, I converted the state space representation into just the 1x4 returned state as opposed to an image. Also, I converted the action output to be a binned output. So the action 1x2 becomes 3x2 when binning is set to 3. So instead of getting the max action row-wise, I get a max action column-wise. I am using fixed targeting (training a primary and target DQN).

Question / Concern

My main issue with the env in general is that keeping the pole vertical is no different to the DQN as holding the pole to near-failure. How does the DQN get better if it is getting +1 reward regardless? My hypothesis is that keeping the pole tilted creates more samples in the memory. Then after, when optimizing the model, the tilted pole states get higher reward than the vertical pole states since there is a higher distribution of tilted pole states. How are we expected to expect a DQN to do well with this kind of reward set up? Wouldn't it be better to have Cart pole produce +1 reward only for if the pole is near vertical?

Extra Information

The goal here is to use the cart-pole for debugging a RL model, then shift it to multi-joint robot control.

The state is normalized to the expected min/maxes: Current State [[0.45564668 0.51196048 0.53126856 0.52450375]] The action input is Raw Action tensor([[0.9477, 0.9471]]) Bin Action [1.0, 0.0]. I am just using simplest action rep for testing below. I have also tested double dqn's, dueling dqns, and the use of PER. I have also been testing dropping the state space down to 1x1 via just inputting the angle of the pole.

X axis is the number of steps during an episode. Y axis is the number of episodes. enter image description here

enter image description here

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top