Question about Latent Dirichlet Allocation (MALLET)

https://stackoverflow.com/questions/4143660

nlp
mallet

30-09-2019
|

Question

Honestly, I'm not familiar with LDA, but am required to use MALLET's topic modeling for one of my projects.

My question is: given a set of documents within a specific timestamp as the training data for the topic model, how appropriate is it to use the model (using the inferencer) to track the topic trends, for documents + or - the training data's timestamp. I mean, is the topic distributions being provided by MALLET a suitable metric to track the popularity of the topics over time if during the model building stage, we only provide a subset of the dataset I am required to analyze.

thanks.

Solution

Are you famailiar with Latent Semantic Indexing? Latent Dirichlet Analysis is just a different way of doing the same kind of thing, so LSI or pLSI you may be an easier starting point to gain knowledge about the goals of LDA.

All three techniques lock on to topics in an unsupervised fashion (you tell it how many topics to look for), and then assume that each document covers each topic in varying proportions. Depending on how many topics you allocate, they may behave more like subfields of whatever your corpus is about, and may not be as specific as the "topics" that people think about when they think about trending topics in the news.

Somehow I suspect that you want to assume that each document represents a particular topic. LSI/pLSI/LDA don't do this -- they model each document as a mixture of topics. That doesn't mean you won't get good results, or that this isn't worth trying, but I suspect (though I don't have a comprehensive knowledge of LSI literature) that you'd be tackling a brand new research problem.

(FWIW, I suspect that using clustering methods like k-Means more readily model the assumption that each document has exactly one topic.)

OTHER TIPS

You should check out the topic-models mailing list at Princeton. They discuss theoretical and practical issues relating to topic models.

I'm aware of three approaches to the tracking the popularity of the topics over time.

It sounds like you might benefit from a dynamic topic modeling approach, which looks at how topics change over time. There's a nice video overview of Blei's work on that here and a bunch of PDFs on his home page. He has a package in C that does it.
A related approach is Alice Oh's topic string approach, where she obtains topics by LDA for texts from time-slices and then uses a topic similarity metric to link topics from different time slices into strings (video, PDF). Looks like MALLET could be part of a topic string analysis, but she doesn't mention how she did the LDA analysis.
The simplest approach might be what David Mimno does in his paper, where he calculates the mean year of a topic from the chronological distribution of the words in the topic. He's involved in the development of MALLET, so it's probably entirely done with that package.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow