Question

As you all know, DQN or DDQN are known for "unstable training".

Let's use the well known "CartPole". The agent has to balance the stick and gets a reward of +1 per frame. You can reach the 195 threshold with Cartpole-v0, but results will vary a lot. You will have a hard time to get this working until it is "nearly stable". Possible reasons are learning rate, batch size and so on...

If you master v0, switch to "Cartpole-v1" and I'm sure your "stable" system will fail again. You normally have to adapt parameters to make it working again. (just my experience)

But, there is something in the workflow of the algo, i don't understand:

for ep in range(num_episodes):

    state = env.reset()
    total = 0.0
    done  = False

    while not done:

        action = agent.get_action(state)
        next_state, reward, done, info = env.step(action)
        agent.remember(state, action, reward, next_state, done)

        agent.train()

        total += reward
        state = next_state

    ep_rewards.append(total)

You all have seen this workflow before, what's the problem here...?

1. We measure performance while we are training and moving the weights

Every agent.train() call does BATCH TRAINING and changes the weights. The "total" reward is calculated with a lot of different models, which one are we measuring?

2. In case of the cartpole example the episode ends (done) if it fails to balance

Some (the first) runs are very short - that leads to lesser training (inconsistent loop count). That means, if the agent performs bad, it does lesser training. If it works well - it trains a lot, does loop a lot until done and moves its weights away from the good policy and can get unstable.

3. If we save a model - which model are we really saving?

We effectively test a bunch of models and get some type of average performance, but what happens if we save the model after a good run? We can have a good run (high total reward) - but save a bad (the last) model? What weights are we really saving, i can't explain that in a way that makes sense, can you?

Now a simple improvement that solves all problems just by moving some code parts:

for ep in range(num_episodes):

    state = env.reset()
    total = 0.0
    done  = False

    while not done:

        action = agent.get_action(state)
        next_state, reward, done, info = env.step(action)
        agent.remember(state, action, reward, next_state, done)

        total += reward
        state = next_state

    # the total result comes from a fixed model
    # correct performance measures and saving ONE model are now possible
    ep_rewards.append(total)

    # train outside of while(done)
    # every episode has now a constant number of train runs, 50 for example
    for i in range(50):
        agent.train()

After changing it i get a really stable performance at max values as you can see here:

Run 7 | Episode: 770 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 780 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 790 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 800 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 810 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 820 | eps 0.0 | total: 432.00 | ddqn True
Run 7 | Episode: 830 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 840 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 850 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 860 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 870 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 880 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 890 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 900 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 910 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 920 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 930 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 940 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 950 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 960 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 970 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 980 | eps 0.0 | total: 500.00 | ddqn True
Run 7 | Episode: 990 | eps 0.0 | total: 500.00 | ddqn True

Are my concerns valid?

Is this a valid improvement or do i miss something here?

Was it helpful?

Solution

The "total" reward is calculated with a lot of different models, which one are we measuring?

The total reward in principle should be calculated by freezing a policy (equivalent to freezing the parameters for a parametric policy) and then computing the average of multiple rollouts on the environment, the usual Monte Carlo.

If it works well - it trains a lot, does loop a lot until done and moves its weights away from the good policy and can get unstable.

One of the reasons of having a replay buffer in off-policy learning is in principle to prevent precisely this catastrophic forgetting in the neural network parametrized for the policy under dataset shift (the shift in distribution of states, actions and rewards observed).

If we save a model - which model are we really saving?

You would probably want to compute the rewards like mentioned earlier (becomes sort of a validation set) and pick the model which gives you the best mean reward (and probably variance). This is often expensive and a cheap proxy is your first version of the algorithm which is biased in reward computation but the bias reduces to zero as the policy converges in theory.

The changes you've made to the data collection and learning loop is a valid one. However, this is usually not done because in principle this is sample inefficient. You have to wait to complete a full trajectory before incorporating that information into the model. A more attractive approach usually followed in literature is where you'd ideally do it every few steps.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top