Q-learning why do we subtract the Q(s, a) term during update?
-
31-10-2019 - |
Question
I can't understand the meaning of $-Q(s_t, a_t)$ term in the Q-learning algorithm, and can't find explanation to it either.
Everything else makes sence. The q-learning algorithm is an off-policy algorithm, unlike SARSA. The Bellman equation describes q-learning as follows:
$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\left[ r_t + \gamma \cdot argmax(Q(s'_t, a'_t)) - Q(s_t, a_t) \right] $$
"The q value for action $a$ taken in state $s$ at time $t$ becomes equal to: that same q-value plus small amount of: currently received reward (reward could be zero) with some amount $\gamma$ of the best Q-value available in the new state, minus our current value
To me, this $-Q(s_t, a_t)$ term at the very end is redundant. ...If we set gamma to $0.8$, the future rewards will decay anyway.
Yes, if instead we now set $\gamma=0$, then the $-Q(s_t, a_t)$ term will drag our value down - is there a case when it would be useful and what would the result be?
Edit:
wikipeda uses a slighly different form of the bellman equation
$$Q(s_t, a_t) \leftarrow (1-\alpha)\cdot Q(s_t, a_t) + \alpha\left[ r_t + \gamma \cdot argmax(Q(s'_t, a'_t)) \right] $$
It's the same equation as the one above, because we can multiply $Q(s_t, a_t)$ with an $\alpha$ and then factor it out, obtaining the first equation.
This representation makes me kind-off understand that we are linearly interpolating from current to newer Q-value, but I can't tie it to the original representation ...In the original representation (the first equation), it magically seems that gamma would be enough - can someone clear it up for me?
No correct solution