I don't know Gensim Python, but MALLET could be a solution. Assuming you have Java expertise, it shouldn't be too difficult.
Create a cc.mallet.types.InstanceList
with your data and fit a cc.mallet.topics.SimpleLDA
model. Then, for each cc.mallet.types.Instance
(Instances are your documents), compute a divergence metric to each other Instance
. For this, you will need to compute the probability of each topic within each Instance
, which is slightly tricky. In SimpleLDA
, there is an ArrayList<TopicAssignment> data
object that holds Instances
and their cc.mallet.topics.TopicAssignment
. A TopicAssignment
contains a cc.mallet.types.LabelSequence
called topicSequence
, which holds the the topic assignment for each word. You will need to loop through this to get counts for each topic. Then, the the probability of topic i in document j is simply (#words assigned to topic i in doc j) / (total words in doc j). Store these probabilities and use them to compute the divergence metric of your choice (e.g., KL divergence).