Why random sample from replay for DQN?

https://datascience.stackexchange.com/questions/24921

31-10-2019
|

Question

I'm trying to gain an intuitive understanding of deep reinforcement learning. In deep Q-networks (DQN) we store all actions/environments/rewards in a memory array and at the end of the episode, "replay" them through our neural network. This makes sense because we are trying to build out our rewards matrix and see if our episode ended in reward, scale that back through our matrix.

I would think the sequence of actions that led to the reward state is what is important to capture - this sequence of actions (and not the actions independently) are what led us to our reward state.

In the Atari-DQN paper by Mnih and many tutorials since we see the practice of random sampling from the memory array and training. So if we have a memory of:

$(action\,a, state\,1) \rightarrow (action\,b, state\,2) \rightarrow (action\,c, state\,3) \rightarrow (action\,d, state\,4) \rightarrow reward!$

We may train a mini-batch of:

[(action c state 3), (action b, state 2), reward!]

The reason given is:

Second, learning directly from consecutive samples is inefficient, due to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates.

or from this pytorch tutorial:

By sampling from it randomly, the transitions that build up a batch are decorrelated. It has been shown that this greatly stabilizes and improves the DQN training procedure.

My intuition would tell me the sequence is what is most important in reinforcement learning. Most episodes have a delayed reward so most action/states do not have a reward (and are not "reinforced"). The only way to bring a portion of the reward to these previous states is to retroactively break the reward out across the sequence (through the future_reward in the Q algorithm of reward + reward * learning_rate(future_reward))

A random sampling of the memory bank breaks our sequence, how does that help when you are trying to back-fill a Q (reward) matrix?

Perhaps this is more similar to a Markov model where every state should be considered independent? Where is the error in my intuition?

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange