Question

It seems to me that the $V$ function can be easily expressed by the $Q$ function and thus the $V$ function seems to be superfluous to me. However, I'm new to reinforcement learning so I guess I got something wrong.

Definitions

Q- and V-learning are in the context of Markov Decision Processes. A MDP is a 5-tuple $(S, A, P, R, \gamma)$ with

  • $S$ is a set of states (typically finite)
  • $A$ is a set of actions (typically finite)
  • $P(s, s', a) = P(s_{t+1} = s' | s_t = s, a_t = a)$ is the probability to get from state $s$ to state $s'$ with action $a$.
  • $R(s, s', a) \in \mathbb{R}$ is the immediate reward after going from state $s$ to state $s'$ with action $a$. (It seems to me that usually only $s'$ matters).
  • $\gamma \in [0, 1]$ is called discount factor and determines if one focuses on immediate rewards ($\gamma = 0$), the total reward ($\gamma = 1$) or some trade-off.

A policy $\pi$, according to Reinforcement Learning: An Introduction by Sutton and Barto is a function $\pi: S \rightarrow A$ (this could be probabilistic).

According to Mario Martins slides, the $V$ function is $$V^\pi(s) = E_\pi \{R_t | s_t = s\} = E_\pi \{\sum_{k=0}^\infty \gamma^k r_{t+k+1} | s_t = s\}$$ and the Q function is $$Q^\pi(s, a) = E_\pi \{R_t | s_t = s, a_t = a\} = E_\pi \{\sum_{k=0}^\infty \gamma^k r_{t+k+1} | s_t = s, a_t=a\}$$

My thoughts

The $V$ function states what the expected overall value (not reward!) of a state $s$ under the policy $\pi$ is.

The $Q$ function states what the value of a state $s$ and an action $a$ under the policy $\pi$ is.

This means, $$Q^\pi(s, \pi(s)) = V^\pi(s)$$

Right? So why do we have the value function at all? (I guess I mixed up something)

Was it helpful?

Solution

Q-values are a great way to the make actions explicit so you can deal with problems where the transition function is not available (model-free). However, when your action-space is large, things are not so nice and Q-values are not so convenient. Think of a huge number of actions or even continuous action-spaces.

From a sampling perspective, the dimensionality of $Q(s, a)$ is higher than $V(s)$ so it might get harder to get enough $(s, a)$ samples in comparison with $(s)$. If you have access to the transition function sometimes $V$ is good.

There are also other uses where both are combined. For instance, the advantage function where $A(s, a) = Q(s, a) - V(s)$. If you are interested, you can find a recent example using advantage functions here:

Dueling Network Architectures for Deep Reinforcement Learning

by Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot and Nando de Freitas.

OTHER TIPS

$V^\pi(s) $ is the state-value function of MDP (Markov Decision Process). It's the expected return starting from state $s$ following policy $\pi$.

In the expression

$$V^\pi(s) = E_\pi \{G_t | s_t = s\} $$

$G_t$ is the total DISCOUNTED reward from time step $t$, as opposed to $R_t$ which is an immediate return. Here you are taking the expectation of ALL actions according to the policy $\pi$.

$Q^\pi(s, a)$ is the action-value function. It is the expected return starting from state $s$, following policy $\pi$, taking action $a$. It's focusing on the particular action at the particular state.

$$Q^\pi(s, a) = E_\pi \{G_t | s_t = s, a_t = a\}$$

The relationship between $Q^\pi$ and $V^\pi$ (the value of being in that state) is

$$V^\pi(s) = \sum_{a ∈ A} \pi (a|s) * Q^\pi(a,s)$$

You sum every action-value multiplied by the probability to take that action (the policy $\pi(a|s)$).

If you think of the grid world example, you multiply the probability of (up/down/right/left) with the one step ahead state value of (up/down/right/left).

You have it right, the $V$ function gives you the value of a state, and $Q$ gives you the value of an action in a state (following a given policy $\pi$). I found the clearest explanation of Q-learning and how it works in Tom Mitchell's book "Machine Learning" (1997), ch. 13, which is downloadable. $V$ is defined as the sum of an infinite series but its not important here. What matters is the $Q$ function is defined as

$$ Q(s,a ) = r(s,a ) + \gamma V^{*}(\delta(s,a)) $$ where V* is the best value of a state if you could follow an optimum policy which you don't know. However it has a nice characterization in terms of $Q$ $$ V^{*}(s)= \max_{a'} Q(s,a') $$ Computing $Q$ is done by replacing the $V^*$ in the first equation to give $$ Q(s, a) = r(s, a) + \gamma \max_{a'} Q(\delta(s, a), a') $$

This may seem an odd recursion at first because its expressing the Q value of an action in the current state in terms of the best Q value of a successor state, but it makes sense when you look at how the backup process uses it: The exploration process stops when it reaches a goal state and collects the reward, which becomes that final transition's Q value. Now in a subsequent training episode, when the exploration process reaches that predecessor state, the backup process uses the above equality to update the current Q value of the predecessor state. Next time its predecessor is visited that state's Q value gets updated, and so on back down the line (Mitchell's book describes a more efficient way of doing this by storing all the computations and replaying them later). Provided every state is visited infinitely often this process eventually computes the optimal Q

Sometimes you will see a learning rate $\alpha$ applied to control how much Q actually gets updated: $$ Q(s, a) = (1-\alpha)Q(s, a) + \alpha(r(s, a) + \gamma \max_{a'} Q(s',a')) $$ $$ = Q(s, a) + \alpha(r(s, a) + \gamma \max_{a'} Q(s',a') - Q(s,a)) $$ Notice now that the update to the Q value does depend on the current Q value. Mitchell's book also explains why that is and why you need $\alpha$: its for stochastic MDPs. Without $\alpha$, every time a state,action pair was attempted there would be a different reward so the Q^ function would bounce all over the place and not converge. $\alpha$ is there so that as the new knowledge is only accepted in part. Initially $\alpha$ is set high so that the current (mostly random values) of Q are less influential. $\alpha$ is decreased as training progresses, so that new updates have less and less influence, and now Q learning converges

Here is a more detailed explanation of the relationship between state value and action value in Aaron's answer. Let's first take a look at the definitions of value function and action value function under policy $\pi$: \begin{align} &v_{\pi}(s)=E{\left[G_t|S_t=s\right]} \\ &q_{\pi}(s,a)=E{\left[G_t|S_t=s, A_t=a\right]} \end{align} where $G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}$ is the return at time $t$. The relationship between these two value functions can be derived as \begin{align} v_{\pi}(s)&=E{\left[G_t|S_t=s\right]} \nonumber \\ &=\sum_{g_t} p(g_t|S_t=s)g_t \nonumber \\ &= \sum_{g_t}\sum_{a}p(g_t, a|S_t=s)g_t \nonumber \\ &= \sum_{a}p(a|S_t=s)\sum_{g_t}p(g_t|S_t=s, A_t=a)g_t \nonumber \\ &= \sum_{a}p(a|S_t=s)E{\left[G_t|S_t=s, A_t=a\right]} \nonumber \\ &= \sum_{a}p(a|S_t=s)q_{\pi}(s,a) \end{align} The above equation is important. It describes the relationship between two fundamental value functions in reinforcement learning. It is valid for any policy. Moreover, if we have a deterministic policy, then $v_{\pi}(s)=q_{\pi}(s,\pi(s))$. Hope this is helpful for you. (to see more about Bellman optimality equation https://stats.stackexchange.com/questions/347268/proof-of-bellman-optimality-equation/370198#370198)

The value function is an abstract formulation of utility. And the Q-function is used for the Q-learning algorithm.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top