Getting most frequent terms in a subset of indexed lucene documents

https://stackoverflow.com/questions/18837655

28-06-2022
|

Question

Let's assume the following scenario.

Lucene document: ArticleDocument

Fields: {Id, text, publisherId}
A publisher can publish multiple articles.

Problem

I would like to build word clouds (most frequent words, shingles) for each Publisher Id.

After my investigation, I could find ways to get most frequent terms for the entire Index or a document but not for a subset of documents. I found a similar question but that's Lucene 2.x and I'm hoping there exists an effective way in recent Lucene.

Please could you guide me with a way to perform that in Lucene 4.x (preferred) or 3.x (latest in version 3).

Please note that I cannot make each Publisher a document with all the articles being appended to a field.

That's because I would like to have those words in the cloud to be searchable with corresponding articles (by same publisher id) being the results.

I'm not sure whether maintaining two types of lucene documents (article and publisher) is a good idea in terms of maintenance and performance.

Solution

Use Pivot Faceting available in Solr 4.X releases. Pivot faceting allows you to facet within the results of the parent facet.

Generate Shingled token for "text" field at indexing time using Shingle Filter Factory.

For faceting add facet=true&facet.pivot=publisherid,text parameters in your query.

Sample query:

http://localhost:8983/solr/collection1/select?q=*:*&wt=json&indent=true&facet=true&facet.pivot=publisherid,text

Query will return frequent shingles/words with frequency for each "publisherid".

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow