Pregunta

From what I understand, a $MDP=(G, A, P, R)$ (markov decision process) is represented as:

  • A complete directed graph $G=(V, E)$
  • A set of actions $A_u$ for each vertex $u \in V$
  • A reward function $R$ that maps any vertex to some reward, i.e., $R \colon V \mapsto \mathbb{R}$.
  • A probability function $P$ that gives the probability $P_a(u, v)\in [0, 1]$ of taking edge $(u, v)$ after performing action $a \in A_u$ at node $u$

The behavior of any MDP is talked about in terms of some policy $F$, where $F_0$ is the initial state that the policy starts in, and $F(u)\in A_u$ is the action $F$ takes when at node $u$, which is defined for all $u \in V$.

We then say that the MDP starts in $u=F_0$, performs some action $a=F(u)$, then moves to some other vertex $v \in V$ with probability $P_a(u, v)$. It repeats this process, where on turn $t$ it performs an action $b=F(w)$ some node $w \in V$, then moves to some other node $z\in V$ according to probability $P_b(w, z)$.

Now, a POMDP (partially observable markov decision process) is defined in a similar manner, except at each state $u \in V$ instead of knowing it's current state, $F$ is given some observation data $O(u) \in S$, where $S$ is any set. $F$ then constructs a "belief state" $b \in [0, 1]^{\vert V \vert }$ based on that information, which is a set of probabilities associated with each node of how much it "believes" it is in that node.

Now we define $F$ in terms of what action it performs in a given belief state, where the belief state can change with time.

This all makes sense, my question it just that since there are a different set of actions for each node, how does $F$ know what actions $A_u$ it has available without knowing what node $u\in V$ it is in?

No hay solución correcta

Licenciado bajo: CC-BY-SA con atribución
No afiliado a cs.stackexchange
scroll top