Question

In the original BERT paper, section 'A.2 Pre-training Procedure', it is mentioned:

The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.

And in the RoBERTa paper, section '4.4 Text Encoding' it is mentioned:

The original BERT implementation (Devlin et al., 2019) uses a character-level BPE vocabulary of size 30K, which is learned after preprocessing the input with heuristic tokenization rules.

I appreciate if someone can clarify why in the RoBERTa paper it is said that BERT uses BPE?

Was it helpful?

Solution

BPE and word pieces are fairly equivalent, with only minimal differences. In practical terms, their main difference is that BPE places the @@ at the end of tokens while wordpieces place the ## at the beginning.

Therefore, I understand that the authors of RoBERTa take the liberty of using BPE and wordpieces interchangeably.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top