clustering documents in Solr index ( with custom distance)

https://stackoverflow.com/questions/20974818

25-09-2022
|

Domanda

I would like to use k-means clustering (machine learning) for clustering the documents in solr lucene. A documents typically have lot of fields, some are text fields and some are locations (lat and long) for doing geospatial distance. Solr provides a way to find the score (distance) between two documents based on specific fields in the index including geo spatial fields (expressed using a solr query). Is there a way to make use this "custom distance" for doing k-means algorithm?

Just want elaborate on the "custom distance" a bit, typically given a value of X for a "dimensions 1" and there is similar numerical value in another document for same "dimension 1" and we find the euclidean distance.

But, in this solr use case, the distance between documents is got on the fly by using solr relevancy score for a given set of documents. This amounts to custom distance. Is there any tool or approach that could help here?

Can i use R or mahout or octave for doing this?

I understand we can export the term vectors from solr and use mahout for the same, but this seems to need export and also doing the same stuff that solr does again in mahout to score. Also again the geo-spacial and elegancy of having a distance using a solr query is lost.

Edit: The solr carrot2 does'nt seem to cut as it is more optimized for search results (<1K results)

Soluzione

You can use any library or self-implemented k-means to do the clustering based on the given similary score.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow