Pergunta

Is it possible to update the Google News Word Embedding with a custom text dataset (text data pertaining to a particular domain) ?

Google News Word2Vec - Word Embedding clearly helps us to come with a robust set of word vectors but it unfortunately cannot be used for most business case. For example:

embeddings.most_similar('python')

[('pythons', 0.6688377857208252),
 ('Burmese_python', 0.6680365204811096),
 ('snake', 0.6606293320655823),
 ('crocodile', 0.6591362953186035),
 ('boa_constrictor', 0.6443518996238708),
 ('alligator', 0.6421656608581543),
 ('reptile', 0.6387744545936584),
 ('albino_python', 0.6158879995346069),
 ('croc', 0.6083582639694214),
 ('lizard', 0.601341724395752)]

This output is clearly not what we want. We could create a custom word2vec model using gensim library for this business case but it would not be exhaustive (vocabulary will be comparatively less). What is best practice in such cases ? Is is possible to update the weights of a pretrained Word Embedding model so that the word embedding also learns from domain text data?

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top