How is the target_f updated in the Keras solution to the Deep Q-learning Cartpole/Gym algorithm?

https://datascience.stackexchange.com/questions/67366

08-12-2020
|

Question

There's a popular solution to the CartPole game using Keras and Deep Q-Learning: https://keon.github.io/deep-q-learning/

But there's a line of code that's confusing, this same question has been asked in the same article and many people are confused but there's no a complete answer.

They are basically creating a main network but also a target network to try to approximate the Q function.

In this part of the code they are replaying from the buffer to train the target network:

# Sample minibatch from the memory
minibatch = random.sample(self.memory, batch_size)

# Extract informations from each memory
for state, action, reward, next_state, done in minibatch:

# if done, make our target reward
target = reward

if not done:
  # predict the future discounted reward
  target = reward + self.gamma * \
           np.amax(self.model.predict(next_state)[0])

# make the agent to approximately map
# the current state to future discounted reward
# We'll call that target_f
target_f = self.model.predict(state)
target_f[0][action] = target

# Train the Neural Net with the state and target_f
self.model.fit(state, target_f, epochs=1, verbose=0)

What I can't understand is this line:

target_f[0][action] = target

In terms of code, the predict function is returning a numpy array of arrays, like this one for example:

[[-0.2635497   0.03837822]]

Writing target_f[0] to access the first predicted action is understandable, but why are they using the [action]?

Thank you very much for the help!

Solution

Hi David and welcome to the community! I think that the [0] is for accessing the array of actions as it is in double brackets. After this you need the action index to update appropriately.

The Q network would have as many outputs as the actions available (in your case 2). Then you want to update the weights of the network in the output layer that is responsible for estimating the Q that corresponds to the action selected. So it seems that you have received a $r(s,action)$ and thus you assign it as target in order to use it for the MSE($Q(s,action)$, $r(s,action)$) between estimation and real reward.

The weights of the other outputs cannot be updated by using the collected $r(s,action)$ as the Q-learning update equation would be wrong so the target remains the same as the prediction (resulting a MSE=0). Think that each of the network's outputs is responsible for estimating the return by performing a particular action, at a particular state. For example the first output is responsible for estimating the return (given input state) for action=left ($Q(s,left)$) and the other for action=right ($Q(s,right)$). You train each network head (output layer) with MSE between real reward and estimation. Each time the reward that is sampled from the environment would be the result of a selected action. Thus you update only the corresponding head by assigning as target the reward sample.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange