Would averaging two vectors in word embeddings make sense?

https://datascience.stackexchange.com/questions/69332

word-embeddings

09-12-2020
|

Pergunta

I'm currently using the GloVe embedding matrix which is pre-trained on a large corpus. For my purpose it works fine, however, there are a few words which it does not know (for example, the word 'eSignature'). This spoils my results a bit. I do not have the time or data to retrain on a different (more domain-specific) corpus, so I wondered if I could add vectors based on existing vectors. By E(word) I denote the embedding of a word. Would the following work?

E(eSignature) = 1/2 * ( E(electronic) + E(signature) )

If not, what are other ideas that I could use to add just a few words in a word embedding?

Solução

Averaging embeddings vectors could make sense if your aim is to represent a sentence or document with a unique vector. For words out of vocabulary it make more sense to just use a random initialisation and allow training of the embedding parameters during the training of the model. In this way the model will learn the representation for the out-of-vocabulary words by itself.

Alternatively, you could use external resources like WordNet [1] to extract a set of synonyms and other words closely related to a specific term, and then leverage the vectors of those close words (averaging them might have sense but it's always a matter of testing and see what happens, as far as I know there are no grounded rules established yet).

[1] https://wordnet.princeton.edu

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange