Domanda

In the original BERT paper, section 'A.2 Pre-training Procedure', it is mentioned:

The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.

And in the RoBERTa paper, section '4.4 Text Encoding' it is mentioned:

The original BERT implementation (Devlin et al., 2019) uses a character-level BPE vocabulary of size 30K, which is learned after preprocessing the input with heuristic tokenization rules.

I appreciate if someone can clarify why in the RoBERTa paper it is said that BERT uses BPE?

È stato utile?

Soluzione

BPE and word pieces are fairly equivalent, with only minimal differences. In practical terms, their main difference is that BPE places the @@ at the end of tokens while wordpieces place the ## at the beginning.

Therefore, I understand that the authors of RoBERTa take the liberty of using BPE and wordpieces interchangeably.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
scroll top