Pergunta

I was a bit confused on the details of the Masked Language Model in BERT pretraining. Does the model only predict the masked tokens for the purposes of pretraining or does it predict it for all tokens?

Foi útil?

Solução

In short, they only predict the masked tokens.

According to the paper in section 3.1:

In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask 15% of all WordPiece tokens in each sequence at random. In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than reconstructing the entire input.

Licenciado em: CC-BY-SA com atribuição
scroll top