Question

I am working on a problem that uses English alphabets in the text but the language is not English. Its a mixture of English and different language text. But all words are written using English alphabets. Now, word-based pre-trained embedding models will not work here as it gives a random embedding to out of vocabulary words.

Now my question is that how the Context-based pre-trained embeddings deal with "out of vocabulary" words?

Besides, what's the difference between context-based embeddings and character-based embeddings?

Was it helpful?

Solution

Context-based or contextual means that the vector contains information about the use of the word in a context of a sentence (or rarely a document). It thus does not make sense to talk about the word embeddings outside of the context of the sentence.

Models such as BERT use input segmentation into so-called subwords which is basically a statistical heuristic. Frequent words are kept intact, whereas infrequent words get segmented into smaller units (which often resemble stemming or morphological analysis, and often seem pretty random), ultimately keeping some parts of the input segmented into characters (this is typically the case of rare proper names). As a result, you get contextual vectors of subwords rather than words.

Character-based embeddings usually mean word-level embeddings inferred by from character input. For instance, ELMo used character-level inputs to get word embeddings that were further contextualized using a bi-directional LSTM. ELMo embeddings are thus both character-based and contextual.

Both when using sub-words and when using embeddings derived from characters, there are technically no OOV words. With subwords, the input breaks to characters (and all characters are always in the vocabulary). With character-level methods, you always get a vector from the characters. There is of course no guarantee that the characters are processed reasonably, but in most cases they are.

Models that use static word embeddings (such as Universal Sentence Encoder) typically reserve a special token for all unknown words (typically <unk>), so the model is not surprised by a random vector at the inference time. If you limit the vocabulary size in advance, the OOV tokens will naturally occur in the training data.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top