Question

I have a decision problem where the results are measured as a cost that I want to minimise. It seems like a good fit to Q-learning, but I am not sure how to adjust it to deal with a cost instead of a reward.

Which one is better:

  1. Initializing Q-values for all actions with zeros, then getting the agent to learn the actions that maximize the Q-values, and later filter out the actions with minimum Q-values. The Q-values update would then be:
q_dict['state1']['act1'] += 
    r + (max([q_dict['state2'][u] for u in q_dict['state2']]))
  1. Initializing Q-values with a big number then getting the agent to learn actions that minimize the Q-values, and later filtering out the actions with minimum Q-values. The Q-values update would then be:
q_dict['state1']['act1'] -= 
    r + (max([q_dict['state2'][u] for u in q_dict['state2']]))

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top