Question

In this Udacity project code that I have been combing through line by line to understand the implementation, I have stumbled on a part in class Actor where this appears on line 55 here: https://github.com/nyck33/autonomous_quadcopter/blob/master/actorSolution.py

# Define loss function using action value (Q value) gradients
        action_gradients = layers.Input(shape=(self.action_size,))
        loss = K.mean(-action_gradients * actions)

The above snippet seems to be creating an Input layer for action gradients to calculate loss for the Adam optimizer (in the following snippet) that gets used in the optimizer but where and how does anything get passed to this action_gradients layer? (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) get_updates function for optimizer where the above loss is used:

# Define optimizer and training function
optimizer = optimizers.Adam(lr=self.lr)
updates_op = optimizer.get_updates(params=self.model.trainable_weights, loss=loss)

Next we get to train_n a K.function type function (also in class Actor):

self.train_fn = K.function(
            inputs=[self.model.input, action_gradients, K.learning_phase()],
            outputs=[],
            updates=updates_op)

of which action_gradients are the (gradient of the Q-value w.r.t. actions from the local critic network, not target critic network).

The following are the arguments when train_fn is called:

    """
    inputs = [states, action_gradients from critic, K.learning_phase = 1 (training mode)], 
    (note: test mode = 0)
    outputs = [] are blank because the gradients are given to us from critic and don't need to be calculated using a loss function for predicted and target actions.  
    update_op = Adam optimizer(lr=0.001) using the action gradients from critic to update actor_local weights
"""

So now I began to think that the formula for deterministic policy gradient here: https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html would be realized once the gradient of the Q-values w.r.t. actions were passed from the critic to the actor.

explanation of DPG

I think that the input layer for action_gradients mentioned at the top is trying to find the gradient of the output of actor w.r.t. to parameters so that it can do the multiplication pictured in the photo. However, to reiterate, how is anything passed to this layer and why is the loss calculated this way?

Edit: I missed a comment on line 55

# Define loss function using action value (Q value) gradients

So now I know that the action_gradients input layer receive the action_gradients from the critic.
Apparently it is a trick used by some implementations such as Openai Baselines: https://stats.stackexchange.com/questions/258472/computing-the-actor-gradient-update-in-the-deep-deterministic-policy-gradient-d

But still, why is the loss calculated as -action_gradients * actions ?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top