Question

I've seen posts on performing autocomplete across multiple fields but not on performing autocomplete on multivalued fields.

My autocomplete feature is working for non-multivalued fields.

My problem is when I run the query on the multivalued field, wherever a document matches that query, all the fields in the multivalued field of that document are returned in the facet results.

Below is my schema, similar to what is proposed in the Solr 4 Cookbook.

 <fieldType name="text_autocomplete" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<field name="publisherText-str" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="publisherText-ac" type="text_autocomplete" indexed="true" stored="true" required="false" multiValued="true"/>

As you can see publisherText is a multivalued field. I execute a query like this to test the autocomplete feature:

/select?q=publisherText-ac:new&facet=true&facet.field=publisherText-str&facet.mincount=1&rows=0

The query is "new", and this matches a set of documents. However the facet result set contains the other publisherText values (contained in the multivalued field) for each matching document.

Update: When querying "new", the result set should include "New York Times" and "Times New Roman" but does not need to solve the infix problem: "Knewton Gazette" does not need to be in the result set.

Is there a way to have the facet result only contain values that match the query? Or is there a different (better?) way to support the full autocomplete feature that handles multiValued fields more gracefully?

Thanks.

Was it helpful?

Solution

I think that the most optimal way would be to create a separate collection or core (depending if you are using cloud or not) and have your data indexed in a way, that it can be queries for the desired query result. Of course it may not be possible in some cases, but if it is in your case go for it. In such core you would only have fields and data relevant to your autocomplete so in most cases it will be smaller, than the original core, less terms and that should result in faster queries. In addition to that, such core or collection optimized for autocomplete queries and you'll gain even more performance out of it.

However if you can't go for multiple cores/collections approach than highlighting may be the best way to go, if you need filtering. In such case you may want to have term verctors turned on and use FastVectorHighlighting to have better performance of Solr highlighting (http://solr.pl/en/2011/06/13/solr-3-1-fastvectorhighlighting/).

OTHER TIPS

I have used these two ways, so far:

(A) stick to using facets and accept that you have to reduce the result via regular expression or String.startsWith. This might actually not be so bad if you use frontend components like the YUI3 Autocomplete plugin which offers this feature already without you having to do much about it.

(B) use highlighting by adding to your query:

&hl=true&hl.fl=publisherText-ac

For each hit, the highlighting component will return the matching value, including highlighting tags (by default <em>). This is even more helpful if your autocomplete field is sourced by several input fields and you don't want to search through the results to find out which field contains the matching value. The resulting list may contain duplicates, however.

I am using both approaches, (A) for autocomplete on single fields, (B) when sourcing autocomplete from multiple fields. I tried to get rid of the <em> tags included in the highlighting results but that has proven quite impossible (you can only change them but not remove them completely).

(using SOLR 4.0 over here)

You can just use the facet.prefix=new parameter and let solr filter those entries out for you. What I would also consider is to avoid making ngrams here. Making a facet and using the facet.prefix does the trick already. Hopefully you will not have too many unique terms and performance will be fine.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top