Question

I have a question for which I could not find the answer to it:

While training reinforcement learning (using DQN), I get a model for the best reward for the next action. Now, if I deploy this model (i.e., use this model to make predictions), does it keep learning (i.e., updating the Q values)?

Was it helpful?

Solution

Now, if I deploy this model (i.e., use this model to make predictions), does it keep learning (i.e., updating the Q values)?

If you want it to (and understand how to code it) then yes a reinforcement agent - including a DQN-based one - can do this. This is online learning, and is possible also with many supervised learning techniques.

Because there is risk of the agent learning incorrectly, you may choose to limit online learning, or disable it in production. It could also happen by mistake if you are not sure how to stop it, and simply deploy your existing agent code from the training scripts into production. So make sure that you, or whoever you get to implement production version of the agent, understands how to control whether learning is still occurring.

You may choose to enable online learning if your initial training was in simulation, and you would like the agent to learn more from real-world interactions. Or you might choose to do so if the problem was non-stationary because the environment description changes over time. Many problems involving interacting with a population of people change over time with demographic changes.

If the environment is stochastic, then in theory you could also switch off exploration - typically by setting $\epsilon = 0$ in an $\epsilon$-greedy behaviour policy. That would allow the agent a (limited) ability to refine its Q value estimates with a relatively low risk. The agent would continue to attempt to act optimally, but may learn enough to decide that different actions were optimal. Note this is still not without risk, because the learning process could fail in some way, leading the agent to predict a wildly incorrect optimal action.

Allowing Q learning to explore non-optimal actions - typically by setting $\epsilon > 0$ in an $\epsilon$-greedy behaviour policy - is more risky in production, because the agent will occasionally pick a non-optimal action in order to refine its estimates of that action. It may result in improved learning in the longer term though, so you might decide to do that if the consequences of non-optimal actions were mild.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top