How do I build a DQN which selects the correct objects in an environment based on the environment state?

https://datascience.stackexchange.com/questions/74673

11-12-2020
|

Question

I have an environment with 4 objects in it. All of these objects can either be selected or not selected. So the actions taken by my DQN should look like - [1,0,1,1],[0,0,0,1],[1,1,0,0]...etc

Where 1 denotes that the object was selected and 0 denotes that the object was not selected. The environment state being given as input to the DQN consists of attributes for each of the object and other factors of the environment. The DQN would get rewards based on the selection it made. I'm new to reinforcement learning and I've only built DQNs that had to select one action out of the entire action space. But how do I build a DQN or a Reinforcement learning network for this particular environment?

Solution

The DQN agent does not need to care what the actions represent, in your case it only needs to make a discrete choice, and it is simple to enumerate the action space. Ignoring the meaning of them for a moment, you have 16 discrete actions. The simplest way to model that is to have a single index discrete action space of 16 actions for the agent which you then map to the selections that you need in order to assess the results. As long as you do this consistently (e.g. take the binary representation of the action index number), this is fine.

It is also possible that using a more direct representation would help, depending on the true nature of the value function. In which case you could use it, provided you model the neural network for $\hat{q}(s,a,\theta)$ with action vector concatenated to state vector in the input, and a single output of the estimated action value for that specific combination. To assess which action to take, you would create a minibatch of 16 inputs, all of which have the same state component, and cover the 16 possible input variations. Then you would pick the combination with the highest estimate and look at the action part of the input vector to discover which action was estimated to be best.

If you are not sure which approach would suit the problem best, you could try both.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange