Question

I can't understand the meaning of $-Q(s_t, a_t)$ term in the Q-learning algorithm, and can't find explanation to it either.

Everything else makes sence. The q-learning algorithm is an off-policy algorithm, unlike SARSA. The Bellman equation describes q-learning as follows:

$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\left[ r_t + \gamma \cdot argmax(Q(s'_t, a'_t)) - Q(s_t, a_t) \right] $$

"The q value for action $a$ taken in state $s$ at time $t$ becomes equal to: that same q-value plus small amount of: currently received reward (reward could be zero) with some amount $\gamma$ of the best Q-value available in the new state, minus our current value


To me, this $-Q(s_t, a_t)$ term at the very end is redundant. ...If we set gamma to $0.8$, the future rewards will decay anyway.

Yes, if instead we now set $\gamma=0$, then the $-Q(s_t, a_t)$ term will drag our value down - is there a case when it would be useful and what would the result be?


Edit:

wikipeda uses a slighly different form of the bellman equation

$$Q(s_t, a_t) \leftarrow (1-\alpha)\cdot Q(s_t, a_t) + \alpha\left[ r_t + \gamma \cdot argmax(Q(s'_t, a'_t)) \right] $$

It's the same equation as the one above, because we can multiply $Q(s_t, a_t)$ with an $\alpha$ and then factor it out, obtaining the first equation.

This representation makes me kind-off understand that we are linearly interpolating from current to newer Q-value, but I can't tie it to the original representation ...In the original representation (the first equation), it magically seems that gamma would be enough - can someone clear it up for me?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top