Question

From all the docs I read, people push this way and that way on how BERT uses or generates embedding. I GET that there is a key and a query and a value and those are all generated.

What I don't know is if the original embedding - the original thing you put into BERT - could or should be a vector. People wax poetic about how BERT or ALBERT can't be used for word to word comparisons, but nobody says explicitly what bert is consuming. Is it a vector? If so is it just a one-hot vector? Why is it not a GLoVE vector? (ignore the positional encoding discussion for now please)

Was it helpful?

Solution

BERT cannot use GloVe embeddings, simply because it uses a different input segmentation. GloVe works with the traditional word-like tokens, whereas BERT segments its input into subword units called word-pieces. On one hand, it ensures there are no out-of-vocabulary tokens, on the other hand, totally unknown words get split into characters and BERT probably cannot make much sense of them either.

Anyway, BERT learns its custom word-piece embeddings jointly with the entire model. They cannot carry the same type of semantic information as word2vec or GloVe because they are often only word fragments and BERT needs to make sense of them in the later layers.

You might say that inputs are one-hot vectors if you want, but as almost always, it is just a useful didactic abstraction. All modern deep learning frameworks implement embedding lookup just by direct indexing, multiplying the embedding matrix with a one-hot-vector would be just wasteful.

OTHER TIPS

BERT's embeddings are 3 things :

  • Token embeddings
  • Segment embeddings
  • Position embeddings

enter image description here


I guess your question is about token embeddings.

Token embeddings is a vector, where each token is encoded as a vocabulary ID.

Example :

# !pip install transformers

from transformers import BertTokenizer

t = BertTokenizer.from_pretrained("bert-base-uncased")

text = "My dog is cute."
text2 = "He likes playing"

t1 = t.tokenize(text)
t2 = t.tokenize(text2)
print(t1, t2)

tt = t.encode(t1, t2)
print(tt)

['my', 'dog', 'is', 'cute', '.'] ['he', 'likes', 'playing']

[101, 2026, 3899, 2003, 10140, 1012, 102, 2002, 7777, 2652, 102]

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top