Question

Problem statement: I have several documents(20k documents). I need to apply Topic modelling to find similar documents and then analyze those similar documents to find how those are different from each other. Q: Could anyone suggest me any Topic modelling package through which I can achieve this. I am exploring Mallet and Gensim Python. Not sure which would best fit in my requirement.

Any help would be highly appreciated.

Was it helpful?

Solution

I don't know Gensim Python, but MALLET could be a solution. Assuming you have Java expertise, it shouldn't be too difficult.

Create a cc.mallet.types.InstanceList with your data and fit a cc.mallet.topics.SimpleLDA model. Then, for each cc.mallet.types.Instance (Instances are your documents), compute a divergence metric to each other Instance. For this, you will need to compute the probability of each topic within each Instance, which is slightly tricky. In SimpleLDA, there is an ArrayList<TopicAssignment> data object that holds Instances and their cc.mallet.topics.TopicAssignment. A TopicAssignment contains a cc.mallet.types.LabelSequence called topicSequence, which holds the the topic assignment for each word. You will need to loop through this to get counts for each topic. Then, the the probability of topic i in document j is simply (#words assigned to topic i in doc j) / (total words in doc j). Store these probabilities and use them to compute the divergence metric of your choice (e.g., KL divergence).

OTHER TIPS

Mallet is a very easy tool to explore. Instead of using a JAVA implementation of Mallet you can directly execute the binary files which are available here: http://mallet.cs.umass.edu/download.php. You need not even code to generate files like topic distribution in documents. While training topics with mallet using train-topics option, you can specify a file for mallet to write this distribution for you.

After download, just type mallet --help and you can get a list of many things which you can do using mallet. They are self explanatory and very easy to understand.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top