Question

I am trying to learn topics distribution for each document in a corpus.

I have term-document matrix (sparse matrix of dim: num_terms * no_docs) as input to the LDA model (with num_topics=100) and when I try to infer vectors for each document I am getting uniform distribution over them. This is highly unlikely since documents are of different topics.

The relevant code snippet is:

#input : scipy sparse term-doc matrix (no_terms * no_docs)

corpus = gensim.matutils.Sparse2Corpus(term_doc)

lda = gensim.models.LdaModel(corpus, 100)

vec_gen = lda[corpus]

vecs = [vec for vec in vec_gen]

Now for each vector in vecs I am getting same probability for each topic.

Can anyone point out where I am going wrong?

Was it helpful?

Solution

I solved this issue. There is a parameter for minimum probability in gensim's LDA which is set to 0.01 by default. So topics with prob. < 0.01 are pruned from output.

Once I set min. prob to a very low value the results had all topics and their corresponding probability.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top