Question

I have to solve this problem with Q-learning. Well, actually I have to evaluated a Q-learning based policy on it.

I am a tourist manager.

I have n hotels, each can contain a different number of persons.

for each person I put in a hotel I get a reward, based on which room I have chosen.

If I want I can also murder the person, so it goes in no hotel but it gives me a different reward. (OK,that's a joke...but it's to say that I can have a self transition. so the number of people in my rooms doesn't change after that action).

  • my state is a vector containing the number of persons in each hotel.

  • my action is a vector of zeroes and ones which tells me where do I
    put the new person.

  • my reward matrix is formed by the rewards I get for each transition
    between states (even the self transition one).

now,since I can get an unlimited number of people (i.e. I can fill it but I can go on killing them) how can I build the Q matrix? without the Q matrix I can't get a policy and so I can't evaluate it...

What do I see wrongly? should I choose a random state as final? Do I have missed the point at all?

Was it helpful?

Solution 2

RL problems don't need a final state per se. What they need is reward states. So, as long as you have some rewards, you are good to go, I think.

I don't have a lot of XP with RL problems like this one. As a commenter suggests, this sounds like a really huge state space. If you are comfortable with using a discrete approach, you would get a good start and learn something about your problem by limiting the scope (finite number of people and hotels/rooms) of the problem and turning Q-learning loose on the smaller state matrix.

OR, you could jump right into a method that can handle infinite state space like an neural network.

In my experience if you have the patience of trying the smaller problem first, you will be better prepared to solve the bigger one next.

OTHER TIPS

This question is old, but I think merits an answer.

One of the issues is that there is not necessarily the notion of an episode, and corresponding terminal state. Rather, this is a continuing problem. Your goal is to maximize your reward forever into the future. In this case, there is discount factor gamma less than one that essentially specifies how far you look into the future on each step. The return is specified as the cumulative discounted sum of future rewards. For episodic problems, it is common to use a discount of 1, with the return being the cumulative sum of future rewards until the end of an episode is reached.

To learn the optimal Q, which is the expected return for following the optimal policy, you have to have a way to perform the off-policy Q-learning updates. If you are using sample transitions to get Q-learning updates, then you will have to specify a behavior policy that takes actions in the environment to get those samples. To understand more about Q-learning, you should read the standard introductory RL textbook: "Reinforcement Learning: An Introduction", Sutton and Barto.

Maybe it isn't an answer on "is it possible?", but... Read about r-learning, to solve this particular problem you may want to learn not only Q- or V-function, but also rho - expected reward over time. Joint learning of Q and rho results in better strategy.

To iterate on the above response, with an infinite state space, you definitely should consider generalization of some sort for your Q Function. You will get more value out of your Q function response in an infinite space. You could experiment with several different function approximations, whether that is simple linear regression or a neural network.

Like Martha said, you will need to have a gamma less than one to account for the infinite horizon. Otherwise, you would be trying to determine the fitness of N amount of policies that all equal infinity, which means you will not be able to measure the optimal policy.

The main thing I wanted to add here though for anyone reading this later is the significance of reward shaping. In an infinite problem, where there isn't that final large reward, sub-optimal reward loops can occur, where the agent gets "stuck", since maybe a certain state has a reward higher than any of its neighbors in a finite horizon (which was defined by gamma). To account for that, you want to make sure you penalize the agent for landing in the same state multiple times to avoid these suboptimal loops. Obviously, exploration is extremely important as well, and when the problem is infinite, some amount of exploration will always be necessary.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top