How to sort by Lucene.Net field and ignore common stop words such as 'a' and 'the'?

https://stackoverflow.com/questions/66041

09-06-2019
|

Question

I've found how to sort query results by a given field in a Lucene.Net index instead of by score; all it takes is a field that is indexed but not tokenized. However, what I haven't been able to figure out is how to sort that field while ignoring stop words such as "a" and "the", so that the following book titles, for example, would sort in ascending order like so:

The Cat in the Hat
Horton Hears a Who

Is such a thing possible, and if yes, how?

I'm using Lucene.Net 2.3.1.2.

Solution

I wrap the results returned by Lucene into my own collection of custom objects. Then I can populate it with extra info/context information (and use things like the highlighter class to pull out a snippet of the matches), plus add paging. If you took a similar route you could create a "result" class/object, add something like a SortBy property and grab whatever field you wanted to sort by, strip out any stop words, then save it in this property. Now just sort the collection based on that property instead.

OTHER TIPS

When you create your index, create a field that only contains the words you wish to sort on, then when retrieving, sort on that field but display the full title.

It's been a while since I used Lucene but my guess would be to add an extra field for sorting and storing the value in there with the stop words already stripped. You can probably use the same analyzers to generate this value.

There seems to be a catch-22 in that you must tokenize a field with an analyzer in order to strip punctuation and stop words, but you can't sort on tokenized fields. How then to strip the stop words without tokenizing?

For search, I found search lucene .net index with sort option link interesting to solve ur problem

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow