Transformer masking during training or inference?

https://datascience.stackexchange.com/questions/80826

13-12-2020
|

Question

I'm working through Attention is All you Need, and I have a question about masking in the decoder. It's stated that masking is used to ensure the model doesn't attend to any tokens in the future (not yet predicted), so it can be used autoregressively during inference.

I don't understand how masking is used during inference. When the encoder is given an unseen sample with no ground truth output or prediction, it seems to me that there is nothing to mask, since there aren't any output tokens beyond what the decoder has already produced. Is my understanding of masking correct?

Thanks!

Solution

The trick is that you do not need masking at inference time. The purpose of masking is that you prevent the decoder state from attending to positions that correspond to tokens "in the future", i.e., those that will not be known at the inference time, because they will not have been generated yet.

At inference time, it is no longer a problem because there are no tokens from the future, there have not been generated yet.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange