In doc2vec, how to model correctly when many documents share the same label?

https://datascience.stackexchange.com/questions/19107

22-10-2019
|

Pergunta

Of all the examples I've found for doc2vec training, the documents are uniquely labeled. What happens when many documents share the same label?

TaggedDocument of gensim accepts a list labels for the same text. It implies we can have multiple labels for the same text. However, it's not clear to me if it's a good practice to have fragmented texts under the same label. You can still train and get the embeddings. But are they good?

For example, the question I am posting here has a title, a detailed description, and a list of tags. How do I model it for doc2vec to find similar questions?

Note that some of the tags are not in the title nor the description. What's the best way to include them in the doc2vec trainings. Shuffle them and concatenate with title and description? Or keep them as separate entries under the same label?

Solução

I've tried to explain the logic behind labels used in Document vectors in Doc2Vec - How to label the paragraphs (gensim)

To answer your questions.

1) when two documents share the same label, then doc2vec algorithm determines the semantic meaning of the label from both the documents. Note that doc2vec learns the semantic meanings of labels not individual documents.

2) Again, you are not learning documents. you are instructing doc2vec to learn the embeddings for the labels. So, if multiple labels are given for a document, all receives the same semantic meaning from the document and when a part of labels when used in other documents, keep on learning more semantic meaning from them. For instance. doc1-> hunt, bite, eat, flesh doc2-> life, love, eat, money. It is clear that doc1 is about animals and doc2 is about human, the label eat will have semantic meanings from both of them.

3) If your goal is to find similar question such as this, then you should probably give just a single label for the whole question and then find a question with a label having closer cosine distance.

4) Never confuse labels with words. In word2vec, words learn embeddings and in doc2vec, labels learn embeddings from the words used in documents. If you would like to add some more semantic meanings to a document, then you could add it to the particular document as words. If you want to add semantic meanings to a label, the words have to be added to each document which carries that label if you want the label to have strong affinity (but adding words manually, doesnt sound like a good option to me personally).

Outras dicas

For the example you pose, the answers and discussion on this question might be helpful - some good points on options around labelling:

Doc2Vec - How to label the paragraphs (gensim)

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange