Provided answers are good, but they aren't very beginner-friendly. I want to start from training the LDA model and calculate cosine similarity.
Training model part:
docs = ["latent Dirichlet allocation (LDA) is a generative statistical model",
"each document is a mixture of a small number of topics",
"each document may be viewed as a mixture of various topics"]
# Convert document to tokens
docs = [doc.split() for doc in docs]
# A mapping from token to id in each document
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)
# Representing the corpus as a bag of words
corpus = [dictionary.doc2bow(doc) for doc in docs]
# Training the model
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)
For extracting the probability assigned to each topic for a document, there are generally two ways. I provide here the both:
# Some preprocessing for documents like the training the model
test_doc = ["LDA is an example of a topic model",
"topic modelling refers to the task of identifying topics"]
test_doc = [doc.split() for doc in test_doc]
test_corpus = [dictionary.doc2bow(doc) for doc in test_doc]
# Method 1
from gensim.matutils import cossim
doc1 = model.get_document_topics(test_corpus[0], minimum_probability=0)
doc2 = model.get_document_topics(test_corpus[1], minimum_probability=0)
print(cossim(doc1, doc2))
# Method 2
doc1 = model[test_corpus[0]]
doc2 = model[test_corpus[1]]
print(cossim(doc1, doc2))
output:
#Method 1
0.8279631530869963
#Method 2
0.828066885140262
As you can see both of the methods are generally the same, the difference is in the probabilities returned in the 2nd method sometimes doesn't add up to one as discussed here.
For large corpus, the possibility vector could be given by passing the whole corpus:
#Method 1
possibility_vector = model.get_document_topics(test_corpus, minimum_probability=0)
#Method 2
possiblity_vector = model[test_corpus]
NOTE: The sum of probability assigned to each topic in a document may become a bit higher than 1 or in some cases a bit less than 1. That is because of the floating-point arithmetic rounding errors.