Understanding cosine distance with word vectors

https://datascience.stackexchange.com/questions/66720

21-10-2020
|

Pergunta

I'm a new DL4J user, and I'm running all the works of Shakespeare through a Word2Vec neural net. I've got a pretty basic question about how to understand the results so far. In the below example, there's an obvious association with the "ing" in king and the "ing" in other words that probably don't have much to do with king. Am I missing something about how a word2vec formula uses the characters inside the words it is mapping? Or is my net just really untrained?

Also, what does the cosine distance between those example words say to you about the results, if anything? Thank you for your advice!

   List<String> abc = vec.similarWordsInVocabTo("king", 0.8); //80% similar
   System.out.println(abc);

   String[] words = {"woman", "king", "boy", "child", "human"};
   for (String word : words) {
       System.out.println(vec.similarity("man", word));
   }

Output - Similar words to king:

[taking, drinking, kingly, picking, waking, singing, wringing, knight, feigning, beginning, ink, thinking, kin, knocking, making, bringing, knowing, lingring, winking, neighing, king-, kings, asking, stinking, king, liking]

Output - Vector similarity between "man" and woman, king, boy, child, human:

woman:   0.8305895924568176
king:    0.00203840178437531
boy:     0.2974374294281006
child:   0.4752597510814667
human:  -0.10414568334817886

Solução

Word2Vec algorithm does not go inside words. Word “king” is never used as a gerund, so there is no reason why it should be similar to gerunds.

My guesses are:

Your corpus might by wrongly tokenized. Maybe there are some OCR-related errors with word splitting something like “li-↲ king”.
You might be using a different algorithm for getting the embeddings (e.g., FastText) that goes inside the words and infers the word embedding as some embeddings of character n-grams the word consists of.

On the other hand, words similar to man look fine. If you think about how Word2Vec (and also FastText) is trained, you should not ask a question: ”Do the words have as similar meaning as possible?” but rather “Does the word appear similarly frequently in a similar context Shakespear's works?”

(Of course, when the embeddings are trained on data which is large enough, there is almost no difference between these two questions.)

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange