Pergunta

I am trying to apply word2vec/doc2vec to find similar sentences. First consider word2vec for word similarity. What I understand is, CBOW can be used to find most suitable word given a context, whereas Skip-gram is used to find the context given some word, so in both cases, I am getting words that co-occur frequently. But how does it work to find similar words? My intuition is, since similar words tend to occur in similar contexts, the words similarity is actually measured from the similarity among contextual/co-occuring words. In the neural net, when the vector representation for some word at the hidden layer is passed through to the output layer, it outputs probabilities of co-occuring words. So, the co-occuring words influence the vectors of some words, and since similar words have similar set of co-occuring words, their vector representations are also similar. To find the similarity, we need to extract the hidden layer weights (or vectors) for each word and measure their similarities. Do I understand it correctly?

Finally, what is a good way to find tweet text (full sentence) similarity using word2vec/doc2vec?

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top