(RL Curiosity) - “Exploration by Random Network Distillation” - what's the benefit?

https://datascience.stackexchange.com/questions/75048

reinforcement-learning

11-12-2020
|

Pregunta

Curiosity-Driven learning motivates the agent to explore unseen states. It does it by rewarding the agent more when its expectation differs from the actual next state. (orange color on the diagram). This extra reward is called "intrinsic reward" $r^i$ which is added to the usual reward (we call it with a new name "extrinsic reward" $r^e$). Here is a video, explaining this architecture

But this causes issues in states that change unpredictably ("noisy TV screens" or tree leaves that move because of the wind). The agent will then get stuck, contemplating them forever, because the ExpectedState - observedState is very high, resulting in high reward.

As I understod, the authors tackle the problem by intruducing the Inverse-Model (blue on the diagram). It tries to predict an action that was taken to get from currentState to the observedState.

This teaches the agent to only consider features that are relevant to the transition from currentState to nextState, and ignore stuff like noisy TV-screens.

Question:

Why is there a paper called Exploration by Random Network Distillation that builds on top of the original paper? To me, it seems to solve this same issue.

Namely we predict the output of a fixed randomly initialized neural network on the current observation

Wasn't the problem already tackled by the Inverse Model (blue on the diagram) in the original paper?

When using Random network distillation, the agent tries to predict the embedding of a current state, not the next state as I thought initially.

If the agent sees a random frame on TV, initially it will be receiving high “intrinsic” rewards, because it doesn’t match the output of the fixed random target network. An embedding produced by the target network is always unpredictable (although repeatable), meaning the online network will need to overfit, which is what we want in this case. This approximates the state-visit count. So RND is an interesting form of curiosity. Is more useful than vanilla?

Solución

Both papers are attempting to incentivize exploration for reinforcement learning agents.

Curiosity-driven Exploration by Self-supervised Prediction does that by adding an inverse model, increases the complexity of the loss function of the overall model.

Exploration by Random Network Distillation does that having two separate models. One of the model tracks and learns the value of exploration.

There are pros and cons to both approaches. It is easier to train a single loss function/model. However, two models that learn to interact are more flexible.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a datascience.stackexchange