Latent Dirichlet Allocation vs Hierarchical Dirichlet Process

https://datascience.stackexchange.com/questions/128

16-10-2019
|

Question

Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Process (HDP) are both topic modeling processes. The major difference is LDA requires the specification of the number of topics, and HDP doesn't. Why is that so? And what are the differences, pros, and cons of both topic modelling methods?

Solution

HDP is an extension of LDA, designed to address the case where the number of mixture components (the number of "topics" in document-modeling terms) is not known a priori. So that's the reason why there's a difference.

Using LDA for document modeling, one treats each "topic" as a distribution of words in some known vocabulary. For each document a mixture of topics is drawn from a Dirichlet distribution, and then each word in the document is an independent draw from that mixture (that is, selecting a topic and then using it to generate a word).

For HDP (applied to document modeling), one also uses a Dirichlet process to capture the uncertainty in the number of topics. So a common base distribution is selected which represents the countably-infinite set of possible topics for the corpus, and then the finite distribution of topics for each document is sampled from this base distribution.

As far as pros and cons, HDP has the advantage that the maximum number of topics can be unbounded and learned from the data rather than specified in advance. I suppose though it is more complicated to implement, and unnecessary in the case where a bounded number of topics is acceptable.

OTHER TIPS

Anecdotally, I've never been impressed with the output from hierarchical LDA. It just doesn't seem to find an optimal level of granularity for choosing the number of topics. I've gotten much better results by running a few iterations of regular LDA, manually inspecting the topics it produced, deciding whether to increase or decrease the number of topics, and continue iterating until I get the granularity I'm looking for.

Remember: hierarchical LDA can't read your mind... it doesn't know what you actually intend to use the topic modeling for. Just like with k-means clustering, you should choose the k that makes the most sense for your use case.

I wanted to point out, since this is one of the top Google hits for this topic, that Latent Dirichlet Allocation (LDA), Hierarchical Dirichlet Processes (HDP), and hierarchical Latent Dirichlet Allocation (hLDA) are all distinct models.

LDA models documents as dirichlet mixtures of a fixed number of topics- chosen as a parameter of the model by the user- which are in turn dirichlet mixtures of words. This generates a flat, soft probabilistic clustering of terms into topics and documents into topics.

HDP models topics as mixtures of words, much like LDA, but rather than documents being mixtures of a fixed number of topics, the number of topics is generated by a dirichlet process, resulting in the number of topics being a random variable as well. The "hierarchical" portion of the name refers to another level being added to the generative model (the dirichlet process producing the number of topics), not the topics themselves- the topics are still flat clusterings.

hLDA, on the other hand, is an adaptation of LDA that models topics as mixtures of a new, distinct level of topics, drawn from dirichlet distributions and not processes. It still treats the number of topics as a hyperparameter, i.e., independent of the data. The difference is that the clustering is now hierarchical- it learns a clustering of the first set of topics themselves, giving a more general, abstract relationships between topics (and hence, words and documents). Think of it like clustering the stack exchanges into math, science, programming, history, etc. as opposed to clustering data science and cross validation into an abstract statistics and programming topic that shares some concepts with, say, software engineering, but the software engineering exchange is clustered on a more concrete level with the computer science exchange, and the similarity between all of the mentioned exchanges doesn't appear as much until the upper layer of clusters.

I have a situation where HDP works well compared to LDA. I have about 16000 documents that belong to various classes. As I am unaware of how many different topics I can gather for each class, HDP is really helpful in this case.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange