Topic Modelling in MALLET vs NLTK

https://stackoverflow.com/questions/7476180

nltk
mallet

23-01-2021
|

Question

I just read a fascinating article about how MALLET could be used for topic modelling, but I couldn't find anything online comparing MALLET to NLTK, which I've already had some experience with.

What are the main differences between them? Is MALLET a more 'complete' resource (e.g. has more tools and algorithms under the hood)? Or where are some good articles answering these first two questions?

Solution

It's not that one is more complete than the other it is more a question of one having some stuff the other doesn't and vice versa. It also a question of intended audience and purpose.

Mallet is a Java based machine learning toolkit that aims to provide robust and fast implementations for various natural language processing tasks.

NLTK is built using Python and comes with a lot of extra stuff like corpora such as WordNet. NLTK is aimed more at people learning NLP, and as such is used more as a learning platform and perhaps less as an engineering solution.

In my opinion the main difference between the two is that NLTK is better positioned as a learning resource for people interested in machine learning and NLP as it comes with a whole ton of documentation, examples, corpora etc. etc.

Mallet is more aimed at researchers and practitioners that work in the field and already know what they want to do. It comes with less documentation (although it has good examples and the API is well documented) compared to NLTK's extensive collection of general NLP stuff.

UPDATE: Good articles describing these would be the Mallet docs and examples at http://mallet.cs.umass.edu/ - the sidebar has links to sequence tagging, topic modelling etc.

and for NLTK the NLTK book Natural Language Processing with Python is a good introduction both to NLTK and to NLP.

UPDATE

I've recently found the sklearn Python library. This is aimed at machine learning more generally, not directly for NLP but can be used for that as well. It comes with a very large selection of modelling tools and most of it seems to rely on NumPy so it should be pretty fast. I've used it quite a bit and can say that it is very well written and documented and has an active developer community pushing it forward (as of May 2013 at least).

UPDATE 2

I've now also been using mallet for some time (specifically the mallet API) and can say that if you're planning on integrating mallet into another project you should be very familiar with Java and ready to spend a lot of time debugging an almost completely undocumented code base.

If all you want to do is to use the mallet command line tools, that's fine, using the API requires a lot of digging through the mallet code itself and usually fixing some bugs as well. Be warned mallet comes with minimal documentation with regards to the API.

OTHER TIPS

The question is whether you're working in Python or Java (or none of the above). Mallet is good for Java (therefore Clojure and Scala) since you can easily access it's API in Java. Mallet also has a nice commandline interface so you can use it outside of an application.

For the same reason with Python, NLTK is great for python, and you won't have to do any Jython craziness to get these to play well together. If you're using python, Gensim just added a Mallet wrapper that is worth checking out. Right now, it's basically a bare-bones alpha feature, but it may do what you need.

I'm not familiar with NLTK's topic modeling toolkit, so I won't try to compare it. The Mallet sources in Github contain several algorithms (some of which are not available in the 'released' version). To my knowledge, there are

SimpleLDA (LDA with collapsed Gibbs sampling)
ParallelTopicModel (LDA that works on multi-core)
HierarchicalLDA
LabeledLDA (a semi-supervised approach to LDA)
Pachinko Allocation with LDA.
WeightedTopicModel

It also has

a couple of classes that help in diagnosis of LDA models. (TopicModelDiagnostics.java)
The ability to serialize and de-serialize a trained LDA model.

All in all, it is a fine toolkit for experimenting with topic models, with a approachable open-source license (CPL).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow