Solr: Cannot correctly tokenize query terms

https://stackoverflow.com/questions/22622202

20-06-2023
|

Question

I have the following analyzer setup for both the query and index for a particular field type. It should split terms such as "java/cpp" => "java" "cpp" , i.e. 2 tokens due to a PatternTokenizer I have defined in the schema.xml. This is applied correctly at index time but not at query time. Testing with the analyzer GUI tester in Solr, it seems to work correctly. Here's the start of the analyzer chain:

<analyzer type="query">

        <!-- Collapse hyphens around alpha words -->
        <charFilter class="solr.PatternReplaceCharFilterFactory"
                    pattern="([a-zA-Z])(-+)([a-zA-Z])"
                    replacement="$1$3"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory"
                    pattern="([0-9])(,)([0-9])"
                    replacement="$1$3"/>

        <!-- Else Split on hyphens (numbers, etc), plus other splitable chars -->
        <tokenizer class="solr.PatternTokenizerFactory" pattern="([\\*/\),\(\-]|\s)+" />
        ...........
</analyzer>

If I send in a query like:

http://mysolrserver.com/solr/core1/select?q=java%2Fcpp&rows=2&fl=score%2Ctitle%2Ccity&df=title&wt=xml&indent=true&debugQuery=true&defType=edismax&qf=title&tie=1.0

where the q (query) parameter is set to: java/cpp (no quotes), and the query field (qf) is pointing to a field of that fieldtype (title), the debugQuery option shows the following in the response:

<lst name="debug">
  <str name="rawquerystring">java/cpp</str>
  <str name="querystring">java/cpp</str>
  <str name="parsedquery">(+DisjunctionMaxQuery((title:"java cpp")~1.0))/no_coord</str>
  <str name="parsedquery_toString">+(title:"java cpp")~1.0</str>

So it appears to send over java/cpp as a phrase query, despite the omission of quotes in the query I am sending in. It appears to be applying other analyzer transforms from my schema.xml, but it does not seem to be appropriately tokenizing words in the query. It should be splitting this into two terms due to the PatternTokenizerFactory defined above. Am I doing something wrong?

Solution

Turns out I had the autoGeneratePhraseQueries attribute of the fieldtype node set to true. This generates phrases for terms that get split by the tokenizer. This was not what I wanted. Setting this to false fixed the issue.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow