Question

In RL Course by David Silver - Lecture 7: Policy Gradient Methods, David explains what an Advantage function is, and how it's the difference between Q(s,a) and the V(s)

enter image description here

Preliminary, from this post:

First recall that a policy $\pi$ is a mapping from each state, $s$, action $a$, to the probability $\pi(a \mid s)$ of taking action $a$ when in state $s$.

The state value function, $V^\pi(s)$, is the expected return when starting in state $s$ and following $\pi$ thereafter.

Similarly, the state-action value function, $Q^\pi(s, a)$, is the expected return of when starting in state $s$, taking action $a$, and following policy $\pi$ thereafter.

In my understanding, $V(s)$ is always larger than $Q(s, a)$, because the function $V$ includes the reward for the current state $s$, unlike $Q$. So, why is the advantage function defined as $A = V - Q$ rather than $A = Q - V$ (at minute 1:12:29 in the video)?

Actually, V might not be larger than Q, because $s$ might actually contain a negative reward. In such a case how can we be certain what to subtract from what, such that our Advantage is always positive?

$Q(s, a)$ returns a value of entire total reward that's expected ultimately, after we pick an action $a$. $V(s)$ is the same, just with an extra reward from current state $s$ as well.

I don't see why a value of $Q - V$ would be useful. On the other hand, $V - Q$ would be useful because it would tell us the reward we would get on $s_{t+1}$ if we took the action $a$.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top