Pergunta

In word2vec why is the implementation of likelihood function multiplication of probabilities of finding a neighbouring word given a word? I didnt get why the probabilities should be multiplied.Is there a reason/intuition behind it ?

Foi útil?

Solução

"probabilities of finding a neighboring word given a word"

here you refer to the Skip-Gram architecture, where given the center word you predict the surrounding words.

This extract from these notes might clarify your question. Note that by assuming the conditional independence the total probability factors into a product.

"As in CBOW, we need to generate an objective function for us to evaluate the model. A key difference here is that we invoke a Naive Bayes assumption to break out the probabilities. If you have not seen this before, then simply put, it is a strong (naive) conditional independence assumption. In other words, given the center word, all output words are completely independent."

Maybe this article can also help, though it is about negative sampling it is a very clear exposition.

Licenciado em: CC-BY-SA com atribuição
scroll top