Calculating similarity between and centroid of Lucene documents

https://stackoverflow.com/questions/3447175

27-09-2019
|

Question

In order to perform a simple clustering algorithm on results that I get from Lucene, I have to calculate Cosine similarity between 2 documents in Lucene, I also need to be able to make a centroid document to represent the centroid of each cluster.

All I can think of doing is building my own Vector Space model with tf-idf weighting, using the TermFreqVectors and Overall Term frequencies to populate it.

My question is: This is not an efficient approach, is there a better way to do this?

This feels a little unclear so any suggestions on how I can improve my question are also appreciated.

Solution 2

The short answer is: No.

I have spent a lot of time (way way too much) looking into this, and as far as I can see, you can make your own Vector Space Model and work from that, or use Mahout to generate a Mahout Vector, which you can make comparisons between documents from. I am gonna go ahead and make my own, so I'm marking this question answered!

OTHER TIPS

Mark, you may find Integrating Mahout with Lucene, IR Math with Java or Vector Space Classifier Using Lucene useful.

in order to get similarity of one document to the other, why not make a one query with the content of one document and run query against index? that way, you will get score(cosine similarity values)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow