Question

We've trouble searching parts of sentences using Solr search. We tried different queries on the results below:

<doc>
    <str name="id">7975</str>
    <str name="name">Ici Paris XL geschenkset aanbieding</str>
    <long name="_version_">1467524712314699776</long></doc>
</doc>
<doc>
    <str name="id">7976</str>
    <str name="name">De tuinen geschenkset aanbieding</str>
    <long name="_version_">1467524712315748352</long></doc>
</doc>

Searches we tried:

  • *:* returns everything
  • *Paris* returns just the first onen (with Paris, this is correct)
  • *Paris*XL* returns just the first one
  • *paris*XL* returns nothing (capital P is now lowercase)
  • *(Paris XL)* returns everything
  • *"Paris XL"* returns everything
  • (Paris XL) returns nothing
  • "Paris XL" returns nothing

What we want is to be able to search for "Paris XL" (WITH the spaces) and return the first result. So we can search on parts of sentences. Is that possible? And how?

Was it helpful?

Solution

What you are missing is a fundamental part of how Lucene and hence Solr performs its' searches. Solr does not search, but it looks the tokens (words) of a search up in the index that has been created for a certain set of documents. Such an index is - very simplified - like the index at end of every book. You can look up there where each word is placed within the actual text.

Currently your field name is not tokenized at all, as you use the fieldType string. This means that the whole content of that field is indexed as one word. This is a problem, as you have already noticed.

When you now search for paris, Solr takes a look into the index and searches if the token paris exists there. But that is not the case. Taking the two examples you have posted, your index consist of two entries

  1. Ici Paris XL geschenkset aanbieding
  2. De tuinen geschenkset aanbieding

Each of those entries needs to be equal to paris - from the front to the end - in order to produce a hit. This is not the case. So you surrounded paris with the wildcard *. This makes Solr have a look into each entry of the index, this will result in very, very, very bad performance.


So what should you do to tackle this? Start tokenizing! This is described rather good in the official documentation or in the Solr Tutorial.


In the end you will add a custom fieldType in your schema.xml, similar to this one

<fieldType name="text_general" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Then you will need to change the fielType of your field name to that new fielType and you are required to rebuild your index.

Then your index will contain more entries for the field name, not like above just two, but these ones

  1. ici
  2. paris
  3. xl
  4. geschenkset
  5. aanbieding
  6. de
  7. tuinen

Then you can query for "paris xl". Take a note at the apostrophs before and after the search. This makes Solr perform a phrase query, so it will only produce a hit only if paris is followed by xl.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top