Question

I have some repeated (same strings) data in a multiValue field on my solr index and i want to boost documents by matches count in that field. For example:

doc1 : { locales : ['en_US', 'de_DE', 'fr_FR', 'en_US'] }
doc2 : { locales : ['en_US'] }

When i run the query q=locales:en_US i would like to see the doc1 at the top because it has two "en_US" values. What is the proper way to boost this kind of data?

Should i use a special tokenizer?

Solr version is: 4.5

No correct solution

OTHER TIPS

Disclaimer

In order to use either of the following solutions you will need to make either one of the following changes:

  • Create a copyField for locales:
<field name="locales" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- No need to store(stored="false") locales_text as it will only be used for searching/sorting/boosting -->
<field name="locales_text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<copyField source="locales" dest="locales_text"/>
  • Change the type of locales to "text_general" (the type is provided in the standard solr collection1)

First solution (Ordering):

Results can be ordered by some function. So we can order by number of occurrences (termfreq function) in field:

  • If copyField is used, then sort query will be: termfreq(locales_text,'en_US') DESC

  • If locales is of text_general type, then sort query will be: termfreq(locales,'en_US') DESC

Example response for copyField option (the result is the same for text_general type):

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
    <str name="fl">*,score</str>
    <str name="sort">termfreq(locales_text,'en_US') DESC</str>
    <str name="indent">true</str>
    <str name="q">locales:en_US</str>
    <str name="_">1383598933337</str>
    <str name="wt">xml</str>
  </lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="0.5945348">
  <doc>
    <arr name="locales">
      <str>en_US</str>
      <str>de_DE</str>
      <str>fr_FR</str>
      <str>en_US</str>
    </arr>
    <str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
    <long name="_version_">1450808563062538240</long>
    <float name="score">0.4203996</float></doc>
  <doc>
    <arr name="locales">
      <str>en_US</str>
    </arr>
    <str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
    <long name="_version_">1450808391856291840</long>
    <float name="score">0.5945348</float></doc>
</result>
</response>

You can also use fl=*,termfreq(locales_text,'en_US') to see the number of matches.

One thing to keep in mind - it is an order function, not a boost function. If you will rather boost score based on multiple matches, you will be probably more insterested in the second solution.

I included the score in the results to demonstrate what @arun was talking about. You can see that the score is different(probably to length)... Quite unexpected(for me) that for multivalued string it is the same.

Second solution (Boosting):

  • If copyField is used, then the query will be : {!boost b=termfreq(locales_text,'en_US')}locales:en_US

  • If locales is of text_general type, then the query will be: {!boost b=termfreq(locales,'en_US')}locales:en_US

Example response for copyField option (the result is the same for text_general type):

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
  <lst name="params">
    <str name="lowercaseOperators">true</str>
    <str name="fl">*,score</str>
    <str name="indent">true</str>
    <str name="q">{!boost b=termfreq(locales_text,'en_US')}locales:en_US</str>
    <str name="_">1383599910386</str>
    <str name="stopwords">true</str>
    <str name="wt">xml</str>
    <str name="defType">edismax</str>
  </lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="1.1890696">
  <doc>
    <arr name="locales">
      <str>en_US</str>
      <str>de_DE</str>
      <str>fr_FR</str>
      <str>en_US</str>
    </arr>
    <str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
    <long name="_version_">1450808563062538240</long>
    <float name="score">1.1890696</float></doc>
  <doc>
    <arr name="locales">
      <str>en_US</str>
    </arr>
    <str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
    <long name="_version_">1450808391856291840</long>
    <float name="score">0.5945348</float></doc>
</result>
</response>

You can see that the score changed significantly. The first document score two time more than the second (because there was two matches each scored as 0.5945348).

Third solution (omitNorms=false)

Based on the answer from @arun I figured that there is also a third option.

If you convert you field to (for example) text_general AND set omitNorms=true for that field - it should have the same result.

The default standard request handler in Solr does not use only the term frequency to compute the scores. Along with term frequency, it also uses the length of the field. See the lucene scoring algorithm, where it says:

lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score.

Since doc2 has a shorter field it might have scored higher. Check the score for the results with fl=*,score in your query. To know how Solr arrived at the score, use fl=*,score&wt=xml&debugQuery=on (then right click on your browser and view page-source to see a properly indented score calculation). I believe you will see the lengthNorm contributing to a lower score for doc1.

To have length of the field not contribute to the score, you need to disable it. Set omitNorms=true for that field. (Ref: http://wiki.apache.org/solr/SchemaXml) Then see what the scores are.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top