Topic Modelling and finding similarity in topics

Question 1

I don't know Gensim Python, but MALLET could be a solution. Assuming you have Java expertise, it shouldn't be too difficult.

Create a cc.mallet.types.InstanceList with your data and fit a cc.mallet.topics.SimpleLDA model. Then, for each cc.mallet.types.Instance (Instances are your documents), compute a divergence metric to each other Instance. For this, you will need to compute the probability of each topic within each Instance, which is slightly tricky. In SimpleLDA, there is an ArrayList<TopicAssignment> data object that holds Instances and their cc.mallet.topics.TopicAssignment. A TopicAssignment contains a cc.mallet.types.LabelSequence called topicSequence, which holds the the topic assignment for each word. You will need to loop through this to get counts for each topic. Then, the the probability of topic i in document j is simply (#words assigned to topic i in doc j) / (total words in doc j). Store these probabilities and use them to compute the divergence metric of your choice (e.g., KL divergence).

Question 2

Mallet is a very easy tool to explore. Instead of using a JAVA implementation of Mallet you can directly execute the binary files which are available here: http://mallet.cs.umass.edu/download.php. You need not even code to generate files like topic distribution in documents. While training topics with mallet using train-topics option, you can specify a file for mallet to write this distribution for you.

After download, just type mallet --help and you can get a list of many things which you can do using mallet. They are self explanatory and very easy to understand.