How to facet phrases in solr?

https://stackoverflow.com/questions/23150062

solr
facet

05-07-2023
|

Question

Lately i have been trying to apply facet to a field with some values having multiple words(a phrase)? I have been suggested to use shingles but am not sure if that would work as expected as the required phrase should be taken from a given list.

For example: when i apply facet to a field, i get seperate facets for 'Information' and 'Technology' whereas i want it to be a single facet like 'Information Technology'.

How to facet a particular phrase in a particular field?

EDIT: The schema for the required field looks like this:

<fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
        <!-- this filter can remove any duplicate tokens that appear at the same position - sometimes
             possible with WordDelimiterFilter in conjuncton with stemming. -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      </fieldType>

The shingles filter doesn't work, as it shows three facets for Information technology: information, technology and information technology.

Solution

The problem seems to be that the facet field words are being split in the index, by the analyzers. If you want to facet on fields which has potentially multiple words then we should use the analyzers which does not split the words. It can be "copy field" in solr so that your indexing process doesn't really change. For example you could have something like below.

<field name="facet_text_en_nosplit" type="string" indexed="true" stored="false" multiValued="true"/>

Use the above field in your facet query.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow