Question

I am trying to understand reinforcement learning and markov decision processes (MDP) in the case where a neural net is being used as the function approximator.

I'm having difficulty with the relationship between the MDP where the environment is explored in a probabilistic manner, how this maps back to learning parameters and how the final solution/policies are found.

Am I correct to assume that in the case of Q-learning, the neural-network essentially acts as a function approximator for q-value itself so many steps in the future? How does this map to updating parameters via backpropagation or other methods?

Also, once the network has learned how to predict the future reward, how does this fit in with the system in terms of actually making decisions? I am assuming that the final system would not probabilistically make state transitions.

Thanks

Was it helpful?

Solution

In Q-Learning, on every step you will use observations and rewards to update your Q-value function:

$$ Q_{t+1}(s_t,a_t) = Q_t(s_t,a_t) + \alpha [R_{t+1}+ \gamma \underset{a'}{\max} Q_t(s_{t+1},a') - Q_t(s_t, a_t)] $$

You are correct in saying that the neural network is just a function approximation for the q-value function.

In general, the approximation part is just a standard supervised learning problem. Your network uses (s,a) as input and the output is the q-value. As q-values are adjusted, you need to train these new samples to the network. Still, you will find some issues as you as using correlated samples and SGD will suffer.

If you are looking at the DQN paper, things are slightly different. In that case, what they are doing is putting samples in a vector (experience replay). To teach the network, they sample tuples from the vector, bootstrap using this information to obtain a new q-value that is taught to the network. When I say teaching, I mean adjusting the network parameters using stochastic gradient descent or your favourite optimisation approach. By not teaching the samples in the order that are being collected by the policy the decorrelate them and that helps in the training.

Lastly, in order to make a decision on state $ s $, you choose the action that provides the highest q-value:

$$ a^*(s)= \underset{a}{argmax} \space Q(s,a) $$

If your Q-value function has been learnt completely and the environment is stationary, it is fine to be greedy at this point. However, while learning, you are expected to explore. There are several approaches being $\varepsilon$-greedy one of the easiest and most common ways.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top