Doc2vec(gensim) - How can I infer unseen sentences’ label?

https://datascience.stackexchange.com/questions/10612

gensim

16-10-2019
|

Question

https://radimrehurek.com/gensim/models/doc2vec.html

For example, if we have trained doc2vec with

"aaaaaAAAAAaaaaaa" - "label 1"

“bbbbbbBBBBBbbbb" - "label 2"

can we infer “aaaaAAAAaaaaAA” is label 1 using Doc2vec?

I know Doc2vec can train word vectors and label vectors. Using this vectors, can we infer unseen sentences(combination of trained words) in which label?

Solution

The title of this question is a separate question to its text so I will answer both separately (given that one leads into the other).

How can I infer unseen sentences:

# ... trained model stored in var model
list_of_words = ["this", "is", "a", "new","unseen", "sentence"]
inferred_embedding = model.infer_vector(list_of_words)

How does this work ? As per the original paper (https://cs.stanford.edu/~quocle/paragraph_vector.pdf) we have two weight matrices initialized at random $W \in \mathbb{R}^{N\times P}$ which is the same matrix from Word2Vec in which every column or row maps to a word vector and $D \in \mathbb{R}^{M \times R}$ which is the document matrix where each column or row maps to a sentence/document. During training a softmax classifier of fixed window $k$ size moves (in a moving window like fashion) minimizing the following log likelihood (multi-class cross-entropy):

$$ \frac{1}{M}\sum_{i=1}^{M}\frac{1}{|D_{i}|}\sum_{t=k}^{|D_{i-1}|-k}log(p(w_{t}^{i} | w_{t-k}^{i}, ..., w_{t+k}^{i},D_{i} )) $$

Where $D_{i}$ corresponds to the vector representing the $i^{th}$ sentence, $|D_{i}|$ its number of words in that document and $w_{t}^{i}$ is the $t^{th}$ word in the $i^{th}$ document. All back-propagation remembers is the document we are currently moving our windowed softmax over and only updates that row in matrix $D$ along with the words in that window.

Furthermore when we want to infer something not in the training set we fix $W$ so that it is not updated and augment matrix $D$ with the new randomly initialised row and just train for several iterations (with the new row holding the embedding for the inferred vector). This leads into question 2.

Can we infer that a possibly unseen sentence exactly corresponds to a sentence in the training set ?

The short answer is no and this is not what Doc2Vec is for. Because of the random initialization + the complexity of convergence and training your inferred vector will never be exactly the same as its corresponding vector in $D$ this is why Gensim has not built in a function to support this, how ever given that the model has been well trained these two vectors should be arbitrarily close to each other thus you can conclude that they are extremely similar.

Even fixing the random seed may not work, there are so many other variables that can affect its convergence please see first answer on https://github.com/RaRe-Technologies/gensim/issues/374 .

In any case you can find the most similar label in your data set to an inferred sentence just by iterating over your training set and comparing the similarity to the inferred vector. But why would you want to do an exact match to something in the training set ? that's what regular expressions are for the purpose of these document embedding's is for supervised or unsupervised learning tasks (i.e. classification and clustering).

OTHER TIPS

I was trying to solve this problem today, and coudnt find any module given by gensim.Doc2Vec that calculate similarity between inferred document vector not in trained model to the ones in trained model. So basically I did this.

from scipy import spatial
inferred_vector=model.infer_vector(sentence.split())
for labels in labelled_documents:
    print (1-  spatial.distance.cosine(inferred_vector,model.docvecs[labels]))

Based on Gensim Doc2Vec tutorial, you can do something like this.

inferred_vector = model_dmm.infer_vector(sentence.split())
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

This will give you a list of tuples with all labels and the probability associated with your new document belonging to each label. then you can simply take the largest value as your label.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange