How do you work with Latent Dirichlet Allocation in practice

https://datascience.stackexchange.com/questions/16994

dirichlet

22-10-2019
|

Question

One need to provide LDA with a predefined number of latent topics. Let say I have a text corpus in which I hypothesize there are 10 major topics, all composed of 10 minor subtopics. My objective is to be able to define proximity between documents.

1) How do you estimate the number of topics in practice ? Empirically ? With another method like Hierarchical Dirichlet Process (HDP) ?

2) Do you build several models ? For major and minor topics ? Is there a way to capture the hierarchical structure of the topics ?

Solution

There are many methods of performing this optimization- namely, choosing the optimal number of topics to supply for LDA and many papers have been authored on the topic.

Several of note, which each define metrics by which to evaluate LDA models for quality of topics are:

Rajkumar Arun, V. Suresh, C. E. Veni Madhavan, and M. N. Narasimha Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Advances in knowledge discovery and data mining, Mohammed J. Zaki, Jeffrey Xu Yu, Balaraman Ravindran and Vikram Pudi (eds.). Springer Berlin Heidelberg, 391–402. http://doi.org/10.1007/978-3-642-13657-3_43
Cao Juan, Xia Tian, Li Jintao, Zhang Yongdong, and Tang Sheng. 2009. A density-based method for adaptive lDA model selection. Neurocomputing — 16th European Symposium on Artificial Neural Networks 2008 72, 7–9: 1775–1781. http://doi.org/10.1016/j.neucom.2008.06.011
Romain Deveaud, Éric SanJuan, and Patrice Bellot. 2014. Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique 17, 1: 61–84. http://doi.org/10.3166/dn.17.1.61-84
Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101, suppl 1: 5228–5235. http://doi.org/10.1073/pnas.0307752101

As luck would have it, if you're using R, these metrics have already been compiled for you in a convenient package called ldatuning which provides a set of utilities and metrics to help tune the correct number of topics within LDA models.

Alternatively, if you're using Python, the gensim package can provide you with a lot of utilities to assist. For example, the package implements a metric that they call "topic coherence" that they claim corresponds roughly to clarity of topic distinction for a human among many other utilities for tuning.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange