Q-learning when minimising a total cost instead of maximising a total reward

https://datascience.stackexchange.com/questions/56621

02-11-2019
|

Question

I have a decision problem where the results are measured as a cost that I want to minimise. It seems like a good fit to Q-learning, but I am not sure how to adjust it to deal with a cost instead of a reward.

Which one is better:

Initializing Q-values for all actions with zeros, then getting the agent to learn the actions that maximize the Q-values, and later filter out the actions with minimum Q-values. The Q-values update would then be:

q_dict['state1']['act1'] += 
    r + (max([q_dict['state2'][u] for u in q_dict['state2']]))

Initializing Q-values with a big number then getting the agent to learn actions that minimize the Q-values, and later filtering out the actions with minimum Q-values. The Q-values update would then be:

q_dict['state1']['act1'] -= 
    r + (max([q_dict['state2'][u] for u in q_dict['state2']]))

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange