Question

I'm implementing DQN algorithm from scratch on MountainCar simulation. I'm using a setup of $reward = 1.0$ when car hits the flag, and $0$ otherwise. Reward decay factor is set to $\gamma=0.99$. Algorithm starts with exploration factor of $\epsilon = 1.0$ and decreases that over time to $\epsilon = 0.1$.

If I understood correctly, $Q$ function for some state and action pair is defined as:

$Q(s_t, a_t) = r_t + \gamma \times \operatorname{arg\,max}_a Q(s_{t+1}, a_{t+1})$

So, $Q_{max}$ would satisfy the condition:

$Q_{max} = r_{max} + \gamma \times Q_{max}$

Which means:

$Q_{max} = \frac{r_{max}}{1 - \gamma}$

However, since my network only approximates the $Q$ function it is possible for it to sometimes produce value greater than $Q_{max}$. When that happens, further training causes the values to start growing exponentially, and the entire thing blows up.

When I clamp the error of expected value vs. current predicted value to some small number, it still causes the blowup just a bit slower.

The only solution I can think of is to clamp the predicted value $Q(s_{t+1}, a{t+1})$ to $Q_{max}$ and forcing it to never go above that. I have done that, and got OK results with it.

Does this make sense? Is this a situation that happens in DQN? Or maybe I missed something and my implementation is a bit buggy?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top