Question

I' using lucene with solr to index some documents (news). Those documents also have an HEADLINE. Now I try to make an facet search over the HEADLINE field to find the terms with the highest count. All this works without an problem including an stopword-list. The HEADLINE field is an multi valued field. I use the solr.StandardTokenizerFactory to split those field into single terms (I know, this is not best practise, but it's the only way and it works).

sometimes, the tokenizer splits terms, which shouldn't be splitted, like 9/11 (which is splitted into 9 and 11). So I decided to use an "protword" list. "9/11" is part of this protword list. But no change.

Here is the part from my schema.xml

  <fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory" protected="protwords.txt"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.TrimFilterFactory" />
            <filter class="solr.StopFilterFactory"
                    ignoreCase="true"
                    words="stopwords.txt"
                    enablePositionIncrements="true"
                protected="protwords.txt"
                />
        </analyzer>
   </fieldType>

looking at the facet result, i see a lots of documents dealing with "9/11" grouped (faceted) at "9" or "11" but never "9/11".

Why this does not work?

Thank you.

Was it helpful?

Solution 2

the final solution for that problem was to choose the solr.PatternTokenizerFactory

OTHER TIPS

The problem is that you cannot set a protected words for any filter/tokenizer that you like. Only certain filters support that feature. Therefore, the StandardTokenizer is ignoring your protected words and splitting 9/11 into '9' '11' anyway. Using a WhitespaceTokenizer would ensure that 9/11 does not get split.

In addition, it does not look like the StopFilterFactory acknowledges protected words either (it just filters out stop words like 'to' or 'and'. The WordDelimiterFilterFactory uses protected words. So, you might experiment with that to see if it can help you.

The best way to see how your documents are analyzed is to use the built in Solr administration utility to see how a field is broken down when it is indexed or queried.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top