Question

I am reading this article on how to use BERT by Jay Alammar and I understand things up until:

For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything else.

I have read this topic, but still have some questions:

Isn't the [CLS] token at the very beginning of each sentence? Why is that "we are only interested in BERT's output for the [CLS] token"? Can anyone help me get my head around this? Thanks!

Was it helpful?

Solution

CLS stands for classification and its there to represent sentence-level classification.

In short in order to make pooling scheme of BERT work this tag was introduced. I suggest reading up on this blog where this is also covered in detail.

OTHER TIPS

[CLS] stands for classification. It is added at the beginning because the training tasks here is sentence classification. And because they need an input that can represent the meaning of the entire sentence, they introduce a new tag.

They can’t take any other word from the input sequence, because the output of that is the word representation. So they add a tag that has no other purpose than being a sentence-level representation for classification.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top