Deep Reinforcement Learning - mean Q as an evaluation metric

https://datascience.stackexchange.com/questions/80417

13-12-2020
|

Pregunta

I'm tuning a deep learning model for a learner of Space Invaders game (image below). The state is defined as relative eucledian distance between the player and the enemies + relative distance between the player and 6 closest enemy lasers normalized by the window height (if the player's position is $(x_p,y_p)$ and an enemy's position is $(x_e,y_e)$, the relative euclidian distance is $\frac{\sqrt{(x_p-x_e)^2+(y_p-y_e)^2}}{HEIGHT}$ and HEIGHT is the window height). Hence the observation space dimension is (10+6), which results in an input of my deep neural network of 16 units.

My agent doesn't seem to learn (reward function doesn't increase) and I thought I'd check the mean Q values, which are the output of my main deep neural network, and, instead of increasing, I've remarked that the mean Q values stabilizes (as in figure below) instead of increasing. I've modified many tuning parameters (batch size, neural net architecture and parameters...) but I still have same problem. Any idea why the mean Q values wouldn't increase ?

Here are some results about the learner:

Solución

I think your main problem is use of relative distance as the core feature. It has two major weaknesses:

The distance to an object does not give the direction to the object. The best action choices are all critically dependent on direction. For example an enemy laser bolt 0.1 units directly above the player is an immediate danger requiring evasive action, whilst one 0.1 units to the left or right is not a danger and about to leave the game window. Your feature of relative distance does not distinguish between those scenarios, but it is a critical difference.
Slightly less important, but the raw distance does not capture any sense of movement. If enemies move consistently turn by turn, but not always in the exact same direction or same speed, then their velocities should also be part of the state.

One way you could improve the features is to add a velocity component for each item, showing how quickly it is approaching or receding from the player. This might help a little, but my feeling is that you need more data than distance and speed.

I think you should use normalised $x, y$ position as features for each item being tracked, plus normalised velocity $dx, dy$ for any object type that can change direction (if enemy lasers are always falling straight down you might not need anything for those).

In addition:

If the window edges are important, you should include at least the relative $x$ of one of them, so the agent knows its absolute position on screen and how much space it has to maneuver. This is true whether the player is blocked from moving further left or right, or whether the player "wraps around" to the other side of the screen. Both types of effect will significantly affect how the game plays near the screen edge.
In order to track predicted value, you need to track location of player missiles. It is not enough to just let the agent predict when it is best to fire - in order to accurately track a value function it needs to "see" whether the missile it fired some time steps ago is likely to hit or miss a target.
For both enemy lasers and player missiles, it is OK to filter and sort the data by some criteria (such as distance to player). As long as this is consistent it may even help a lot to have such pre-processing.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a datascience.stackexchange