Faster way to get distinct values from Lucene Query

https://stackoverflow.com/questions/618227

c#
lucene

03-07-2019
|

Question

Currently I do like this:

IndexSearcher searcher = new IndexSearcher(lucenePath);
Hits hits = searcher.Search(query);
Document doc;
List<string> companyNames = new List<string>();

for (int i = 0; i < hits.Length(); i++)
{
    doc = hits.Doc(i);
    companyNames.Add(doc.Get("companyName"));
}
searcher.Close();

companyNames = companyNames.Distinct<string>().Skip(offSet ?? 0).ToList();
return companyNames.Take(count??companyNames.Count()).ToList();

As you can see, I first collect ALL the fields (several thousands) and then distinct them, possibly skip some and take some out.

I feel like there should be a better way to do this.

Solution

I'm not sure there is, honestly, as Lucene doesn't provide 'distinct' functionality. I believe with SOLR you can use a facet search to achieve this, but if you want this in Lucene, you'd have to write some sort of facet functionality yourself. So as long as you don't run into any performance issues, you should be ok this way.

OTHER TIPS

Tying this question to an earlier question of yours (re: "Too many clauses"), I think you should definitely be looking at term enumeration from the index reader. Cache the results (I used a sorted dictionary keyed on the field name, with a list of terms as the data, to a max of 100 terms per field) until the index reader becomes invalid and away you go.

Or perhaps I should say, that when faced with a similar problem to yours, that's what I did.

Hope this helps,

I suggest you to find a logic to skip this kind of iteration but if there is no solution in your context then you can get a performance gain with the following code
1) at Index time it is best to put the field that you want to iterate, in first field

Document doc = new Document();
Field companyField = new Field(...);
doc.Add(companyField);
...

2) then you need to define a FieldSelector like this

class CompanyNameFieldSelector : FieldSelector
{
    public FieldSelectorResult Accept(string fieldName)
    {
        return (fieldName == "companyName" ? FieldSelectorResult.LOAD_AND_BREAK : FieldSelectorResult.NO_LOAD);
    }
}

3) Then when you want to iterate and pick this field you should do something like this

FieldSelector companySelector = new CompanyNameFieldSelector();
// when you iterate through your index
doc = hits.Doc(i);
doc.Get("companyName", companySelector);

The performance of above code is much better than the code you provided cause it skip reading unnecessary document fields, and save time.

public List<string> GetDistinctTermList(string fieldName)
    {
        List<string> list = new List<string>();

        using (IndexReader reader = idxWriter.GetReader())
        {
            TermEnum te = reader.Terms(new Term(fieldName));

            if (te != null && te.Term != null && te.Term.Field == fieldName)
            {
                list.Add(te.Term.Text);

                while (te.Next())
                {
                    if (te.Term.Field != fieldName)
                        break;
                    list.Add(te.Term.Text);
                }
            }
        }

        return list;
    }

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow