Question

I am implementing Softmax Action Selection policy for a reinforcement learning task (http://www.incompleteideas.net/book/ebook/node17.html).

I came with this solution, but I think there is room for improvement.

1-Here I evaluate the probabilities

    prob_t = [0]*3
    denominator = 0
    for a in range(nActions):
        denominator += exp(Q[state][a] / temperature) 

    for a in range(nActions):
        prob_t[a] = (exp(Q[state][a]/temperature))/denominator  

2-Here I am comparing a random generated number in the range ]0,1[ to the probabilities value of the actions:

    rand_action = random.random()
    if rand_action < prob_t[0]:
        action = 0      
    elif rand_action >= prob_t[0] and rand_action < prob_t[1]+prob_t[0]:
        action = 1      
    else: #if rand_action >= prob_t[1]+prob_t[0]
        action = 2

edit:

example: rand_action is 0.78, prob_t[0] is 0.25, prob_t[1] is 0.35, prob_t[2] is 0.4. the probabilities sum to 1. 0.78 is greater than the sum of the probabilities for action 0 and 1 (prob_t[0] + prob_t[1]) therefore action 2 is picked.

Is there a more efficient way of doing this?

Was it helpful?

Solution 3

After the suggestions to use numpy I did a bit of research and came with this solution for the first part of the soft-max implementation.

prob_t = [0,0,0]       #initialise
for a in range(nActions):
    prob_t[a] = np.exp(Q[state][a]/temperature)  #calculate numerators

#numpy matrix element-wise division for denominator (sum of numerators)
prob_t = np.true_divide(prob_t,sum(prob_t))      

There's a for loop less than my initial solution. The only downside I can appreciate is a reduced precision.

using numpy:

[ 0.02645082  0.02645082  0.94709836]

initial two-loops solution:

[0.02645082063629476, 0.02645082063629476, 0.9470983587274104]

OTHER TIPS

Action selection based on probabilities can be easily done using numpy library.

q_values = [] #array of q_values
action = np.random.choice(q_values,p=q_values)

After you evaluate the probabilities for each action, if you have a function to return you weighted random selection, you can get your desired action like this:

action = weighted_choice(prob_t)

Though I am not sure whether this is what you call "better way".

The weighted_choice can be something like this:

import random
def weighted_choice(weights):
    totals = []
    running_total = 0

    for w in weights:
        running_total += w
        totals.append(running_total)

    rnd = random.random() * running_total
    for i, total in enumerate(totals):
        if rnd < total:
            return i

Definitely check the binary search implementation in the article instead of the linear search one above if you have a lot of available actions.

Or if you have access to numpy:

import numpy as np
def weighted_choice(weights):
    totals = np.cumsum(weights)
    norm = totals[-1]
    throw = np.random.rand()*norm
    return np.searchsorted(totals, throw)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top