Question

I have a Latent Dirichlet Allocation (LDA) model with $K$ topics trained on a corpus with $M$ documents. Due to my hyper parameter configurations, the output topic distributions for each document is heavily distributed on only 3-6 topics and all the rest are close to zero ($K$~$\mathcal{O}(100)$). What I mean by this, is that the 3-6 highest contributing topics for all documents is orders of magnitude (about 6 orders) greater than the rest of the topic contributions.

If I use the Jensen-Shannon distance to compute the similarity between documents, I need to store all values of the topic distribution as non-zero, even the very small values of the non contributing topics, because Jensen-Shannon divides by each discrete value in the distribution. This requires a lot of storage and is inefficient.

If, however, I store the topic distributions of each document as a sparse matrix (the 3-6 highest contributing topics are non-zero and the rest are zero) where each row is a unique document and each column is a topic, then this uses far less space. But I can no longer use the Jensen-Shannon metric, because we would be dividing by 0. In this case:

Can I use the euclidean distance between documents topic distributions to compare similarity between documents?

Using the euclidean distance would require far less storage and is extremely fast to compute.

I appreciate that Jensen-Shannon is one of the "correct" metrics to compare discrete probability distributions, as well as the Bhattacharyya distance and Hellinger distance. But ultimately, the output of LDA is a discrete topic distribution for each doucment - each document is a vector (or point) in a $K$ dimensional space. By this argument, is it valid to use the euclidean distance to calcualte documents similarities? Is there something blatantly wrong with this method?

I have tested the euclidean distance to compare documents, and yielded good results, which works well for my industrial application. But I want to know the academics behind such a method. Thanks in advance!

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top