Question

Dirichlet distribution is used in document modelling.

I read from this article that:

Different Dirichlet distributions can be used to model documents by different authors or documents on different topics.

So how could we tell whether it is modelling about different authors or about different topics? This is important because in a document clustering task, it directly dictates the semantic of the clustering result.

And I found it too subjective to limit the possible aspects of modelling to only author or topic. Since there seems to be no strong evidence to favor a specific aspect, it could be any other potential/latent aspect.

Could anyone shed some light on this?

Was it helpful?

Solution

It sounds like you're making a common mistake when thinking about LDA.

LDA is not a document clustering method. Any attempt to assign a topic to a document is incorrect given the model; indeed, any attempt to assign topics to words is also not correct. Instead, LDA is a way of looking at collections of documents, and looking at the way that topics are mixed within those documents. To put it another way, each document does not have a single topic, it has a distribution over topics. This is not uncertainty as to which topic the document belongs to, but rather the proportion of topics used within that document. Given a document you can compute the distribution over topic mixtures within that document; given a collection of documents you can infer both the mixtures within each document and also the topics that best describe that collection. Each word also has uncertainty as to which topic it comes from, since by definition each topic can emit every possible word, but their emission is more probable from some topics than others.

To answer your original question about whether the topics reflect author, topic, style, register, or whatever: the topics don't explicitly represent any of these. They represent groupings of words. Each topic is a distribution over the vocabulary, and so different topics represent different tendencies for word use: in a collection of homogeneous authorship but heterogeneous topic, these might correspond to an intuitive notion of "topic" (i.e. subject matter); in a collection of heterogeneous authors but homogeneous topic, perhaps different topics would correlate with different authors. In a collection of mixed topic, author, register, genre, etc. they may not correspond to any observable characteristic at all.

Instead, the topics are an abstract construction, and all the final topics tell you is what the best topics are for allowing you to reconstruct the original input assuming the model is correct. The sad truth is that this might not correspond to what you want the topics to correspond to because the thing you're really interested in (authorship, say) covaries with other things you're not interested in (register, topic, genre) in the collection you provide. Unless you explicitly mark all the things that could be responsible for a shift in usage of vocabulary, as expressed in a bag of words model, and then devise a model which accounts for them all (not vanilla LDA for certain), you simply won't be able to guarantee correspondence between the topics induced and groupings on the dimension you care about.

OTHER TIPS

It is not modeling authors or topics at all, but latent features, which might well map to real-world concepts like author or topic. For any latent feature, you can see which documents are most strongly associated, and maybe develop an intuitive interpretation of what the feature is "about".

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top