I need to find cosine similarity between two text documents. I need embeddings that reflect order of the word sequence, so I don't plan to use document vectors built with bag of words or TF/IDF. Ideally I would use pre-trained document embeddings such as doc2vec from Gensim. How to map new documents to pre-trained embeddings ?

Otherwise what would be the easiest way to generate document embeddings in Keras/Tensorflow or Pytorch?

有帮助吗?

解决方案

There are several ways you can obtain document embeddings. If you want to obtain a vector of a document that is not part of the trained doc2vec model, gensim provides a method called infer_vector which allows to you map embeddings.

You can also use bert-as-service to generate sentence level embeddings. I would recommend using Google's Universal Sentence Encoder (USE) model to generate sentence embeddings if your goal is to find some sort of similarity between sentences or documents. There are ways you can combine sentence level embeddings to document level, the first step to try would be to take the mean, or you could generate sentence embeddings for a sliding window over the document and take the mean of that.

The reason I recommend USE over BERT is that, USE was trained specially for sentence similarity tasks whereas BERT, even though can be applied to any NLP task was original trained to predict words in a sentence or complete a sentence. You might find this link helpful, it draws a great comparison between USE and BERT, and why it is important to choose a model based on task.

许可以下: CC-BY-SA归因
scroll top