Q-learning in a neural network - Mountain Car

https://stackoverflow.com/questions/18212445

24-06-2022
|

Question

So I've been reading about Q-learning and Neural networks. I believe I have the right idea for it however I would like to have a second opinion on my code for NN and updating with Q-values.

I have created a MatLab implementation of the Mountain Car problem and my neural net, I am using neural network toolbox for the NN part.

It is a network of 2 inputs, 5-20 hidden(for experimentation) and 3 outputs(corresponding to actions in mountain car)

the hidden units are set to tansig and the output is purelin and training function is traingdm

Is this the right steps?

obtain inital state s -> [-0.5; 0.0]
run the network with Qs=net(s) ... this gives me a matrix of 1x3 Q-values corresponding to each action in the intial state s.
select an action a using e-greedy selection
simulate the mountain car and obtain s' (the new state as a result of executing action a)
run the network with Qs_prime=net(s') to get another matrix for the Q-values of s'

Now here is where I am not sure if this is correct, as I have to figure out how to update the weights for the NN properly.

Compute the QTarget, that is = reward+gamma* Max Q-value from s' ?
Create a Targets matrix( 1x3) with the Q-values from the inital s and change the corresponding Q-value for the executed action a to be the QTarget
use net=Train(net,s,Targets) to update the weights in the NN
s=s'
repeat all the above for the new s

Example:

       actions
        1       2      3
Qs  = 1.3346 -1.9000 0.2371

selected action 3(corresponding to move  mountain car forward)

Qs' = 1.3328 -1.8997 0.2463

QTarget=reward+gamma*max(Qs') = -1+1.0*1.3328 = 0.3328

s= [-5.0; 0.0] and Targets = 1.3346 -1.9000 0.3328

Or I have this wrong and the Targets are 0 0 0.3328 

since we don't know how good the other actions are..

here is my Matlab Code( I use R2011 and Neural Network Toolbox)

%create a neural network
num_hidden=5
num_actions=3
net= newff([-1.2 0.6; -0.07 0.07;], [num_hidden,num_actions], {'tansig', 'purelin'},'traingdm');

%network weight and bias initalization
net= init(net);

%turn off the training window
net.trainParam.showWindow = false;

%neural network training parameters
net.trainParam.lr=0.01;
net.trainParam.mc=0.1;
net.trainParam.epochs=100

%parameters for q learning
epsilon=0.9;
gamma=1.0;


%parameters for Mountain car task
maxEpisodes =10;
maxSteps=5000;
reset=false;
inital_pos=-0.5;
inital_vel=0.0;

%construct the inital state
s=[inital_pos;inital_vel];
Qs=zeros(1,3);
Qs_prime=zeros(1,3);

%training for maxEpisodes
for i=1:maxEpisodes
 %each episode is maxSteps long
 for j = 1:maxSteps

    %run the network and get Q values for current state Qs-> vector of
    %current Q values for state s at time t Q(s_t)
    Qs=net(s);


    %select an action
    if (rand() <= epsilon)
        %returns max Q value over all actions
        [Qs_value a]=max(Qs);
    else
        %return a random number between 1 and 3(inclusive)
        a = randint(1,1,3)+1;
    end

    %simulate a step of Mountain Car
    [s_prime, action, reward, reset] = SimulateMC(s,a);

    %get new Q values for S_prime -> Q(s_t+1)
    Qs_prime=net(s_prime);

    %Compute Qtarget for weight updates given by r+y*max Q(s_t+1) over all
    %actions
    Q_target = reward+gamma*max(Qs_prime);

    %Create a Targets matrix with the orginal state s q-values 
    Targets=Qs;

    %change q-value of the original action to the QTarget
    Targets(a)=Q_target;


    % update the network for input state s and targets
    [net TR]=train(net,s,Targets);
    %update the state for next step
    s=s_prime;
    %display exactly where the car is to user the NN learns if this output reaches -0.45
    disp(s(1))

    if reset==true
        bestSteps=j
        break
    end
 end
 %reset for new episode
 reset=false;
 s=[inital_pos;inital_vel];
end

%test the network
%reset state
 s=[inital_pos;inital_vel];
 for i=1:maxEpisodes
    for j=1:maxSteps
        %run the network and get Q values for current state
        Qs=net(s);

        %select the max  action always
         [Qs_value a]=max(Qs);

        %simulate a step of Mountain Car
        [s_prime, action, reward, reset] = SimulateMC(s,a);

        s=s_prime;
        disp(s(1))
    end
     s=[inital_pos;inital_vel];
 end

Thanks

La solution

Problem representation

Using neural networks to represent the value-action function is a good idea. It has been shown that this works well for a number of applications. However, a more natural representation for the Q-function would be a net, that receives the combined state-action vector as input and has a scalar output. But as long as the number of actions is finite and small, it should be possible to do it like you did. Just remember that strictly speaking, you are not learning Q(s,a) but multiple value functions V(s) (one for each action), that share the same weights, except for the last layer.

Testing

This is a straight-forward greedy exploitation of the Q function. Should be correct.

Learning

There are several pitfalls here, you will have to think about. The first one is scaling. For neural network learning you really need to scale the inputs to the same range. If you use a sigmoidal activation function in the output layer you might also have to scale the target values.

Data efficiency is another thing to think about. You can do multiple updates of the net with each transition sample. Learning will be faster, but you would have to store each transition sample in memory.

Online vs. batch: If you store your sample you can do batch learning and avoid the problem that recent samples destroy already learned parts of the problem.

Literature

You should have a look at

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow