What are the elements in a BERT word embedding?

https://datascience.stackexchange.com/questions/67914

08-12-2020
|

Pergunta

As far as I understand, BERT is a word embedding that can be fine-tuned or used directly.

With older word embeddings (word2vec, Glove), each word was only represented once in the embedding (one vector per word). This was a problem because it did not take homonyms into account. As far as I understand, BERT tackles this problem by taking context into condsideration.

What does this mean for the word embedding itself? Is there still one vector for each word token? If so, how is context taken into consideration? If no, what is the format of the embedding?

Solução

Some points first:

BERT is a word embedding: BERT is both word and sentence embedding. It needs to be taken into account that BERT is taking the sequence of words in a sentence into account which gives you a richer embedding of words in a context but in classic embeddings (yes, after BERT we can call others "classic"!) you mostly deal with neighborhood i.e. the semantic of the word vector is kind of the average of all semantics it had in the training set.
it did not take homonyms into account: Either a typo from you or I did not understand homonym very well. To be honest had to search it! Google says "two or more words having the same spelling or pronunciation but different meanings like "right" and "write". That is not a problem for word embedding. Maybe you meant something else?!
BERT tackles this problem by taking context into consideration: All embeddings take context into consideration. The difference is in capturing context when the sequence of the words are taken into account e.g. modeling sentence.

About questions:

Is there still one vector for each word token?: It's one vector per token per layer. Means that for finding a single vector for a word you take n layers and sum the values up. As you know, more you go towards later layers (i.e. toward the output layer) information (features) encoded in layers gets richer. One could also concatenate all after each other and get a higher dimensional representation. Be aware that it is completely task dependent. BERT authors tried different things and came up with summing up last four layers for NER task.

The interesting thing in comparison with classic embeddings is that you can now encode the sentence-dependent semantic of a token! Means that in word2vec (trained on general corpus text like Wikipedia) you have 1 vector for the word apple and if you inspect it you probably see that it is in a relation with both iphone and fruit (never tried it. Just made up an example to make my point for you. Let me know if you tried and something else came out!) but with BERT, you can encode the sentences containing similar words in different context and check the encoding of those words in the sentence. You surprisingly see how it captures the semantics!

The last but not the least is this blog post which is the base of my answer.

Hope it helped!

Outras dicas

When you run BERT, you get one vector per input token + 1 special token called [CLS] + 1 special token called [SEP]. Maybe more precise than calling BERT embeddings as embeddings, would be calling them hidden states of BERT. The contextual information get into the embeddings via 12 layers of self-attentive neural network.

However, the tokenization is tricky with BERT, the tokens are not words. It uses so-called WordPieces to represent the input, i.e., less frequent words are split into smaller units, so at the end, there are no OOV tokens.

With the BERT Base Cased model, the tokenization looks like this:

'I am the walrus.' →
['I', 'am', 'the', 'wa', '##l', '##rus', '.']
'What are the elements in a BERT word embedding?' →
['What', 'are', 'the', elements', 'in', 'a', 'B', '##ER', '##T', 'word', 'em', '##bed', '##ding', '?']

When BERT is trained, there are always two sentences, separated by the [SEP] token. The embedding of the [CLS] is used to predict if the two sentences follow each other in a coherent text. In sentence classification tasks, the embedding of the [CLS] token is used as an input to the classifier.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange