Question

I've been working through the Q-Network learning example in this Arthur Juliani's blog. It's based on the pretty trivia Open Gym Frozen Lake example. It's base implementation get's about 47% success rate over 3000 iterations. I decided to add a bias to the implementation, and found that it severely harmed the results to no better than random.

That is, I added the bias term below:

inputs1 = tf.placeholder(shape=[1,16],dtype=tf.float32)
bias = tf.Variable(tf.zeros(shape=[1,4]))
W = tf.Variable(tf.random_uniform([16,4],0,0.01))
Qout = tf.matmul(inputs1,W) + bias
predict = tf.argmax(Qout,1)

The rest of the code is identical to the original solution. Any ideas why this would so negatively affect performance?

Update It looks like someone else ran into this issue, and the answer given was that

Having a bias term with the one-hot encoding prevents each state’s Q values from being independent

Any ideas why this is the case? The bias is added after the multiplication, so it's in the dimension of the actions, not the inputs. I don't see why this would make learning fail.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top