Question

I've been struggling with Solr and how to deal with compound words for our German site. We mainly deal with clothes and accessories so our search terms are usually words relating to wearable items. I've managed to fine tune the DictionaryCompoundWordTokenFilterFactory so that it splits most of the compound search terms that we may encounter (for example: schwarzkleid => schwarz kleid).

However, the search is returning irrelevant results, it returns items that include only the word "schwarz" and also the items that only include the word: "kleid". So instead of only seeing black dresses (schwarzkleid = black dress), I am seeing dresses of different colors and also items that are black.

Essentially Solr is performing an OR on the split tokens and returning any item that contains either keyword.

My complete query is this: q=keywords:schwarzkleid AND deleted:0 (where a 0 indicates that the product has not been sold out yet). The debug of this query is like this:

"debug": {
"rawquerystring": "keywords:schwarzkleid AND deleted:0",
"querystring": "keywords:schwarzkleid AND deleted:0",
"parsedquery": "+((keywords:schwarzkleid keywords:schwarz keywords:kleid)/no_coord) +deleted:0",
"parsedquery_toString": "+(keywords:schwarzkleid keywords:schwarz keywords:kleid) +deleted:`\b\u0000\u0000\u0000\u0000",

This returns a total of 24000+ results whereas if I search directly for keywords:schwarz AND keywords:kleid I will get ~10000 results which is what I want. I am using Solr 4.7 and the Solr PHP library to interact with it through my web application.

Any ideas on how to fine-tune my query to get only the relevant results?

Here is the fieldType in question:

<!-- German -->
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index"> 
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
    <filter class="solr.GermanNormalizationFilterFactory"/>
    <filter class="org.apache.lucene.analysis.de.compounds.GermanCompoundSplitterTokenFilterFactory" compileDict="true" dataDir="/home/ali/Downloads/solr-4.7.0/example/solr/findemode-dev/conf/wordlist/"/>
    <filter class="solr.SnowballPorterFilterFactory" language="German2"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
  </analyzer>
  <analyzer type="query"> 
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
    <filter class="solr.GermanNormalizationFilterFactory"/>
    <filter class="org.apache.lucene.analysis.de.compounds.GermanCompoundSplitterTokenFilterFactory" compileDict="false" dataDir="/home/ali/Downloads/solr-4.7.0/example/solr/findemode-dev/conf/wordlist/"/>
    <filter class="solr.SnowballPorterFilterFactory" language="German2"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
  </analyzer>
</fieldType>
Was it helpful?

Solution

I've managed to solve this (in a quite hacky sort of way) by using filter queries and the edismax queryparser.

I added in my solrconfig.xml the following parameters:

<str name="defType">edismax</str>
<str name="mm">75%</str>

Then when searching for multiple keywords (for example: schwarzkleid wenz, where wenz is a german brand name), I use the first keyword as a query and anything after that I add as a filterquery. So my final query looks something like this:

fl=id&sort=popular+desc&indent=on&q=keywords:'schwarzkleide'+&wt=json&fq={!edismax}+keywords:'wenz'&fq=deleted:0

My compound splitter filter splits schwarzkleide correctly and it is parsed as edismax with mm=75%, then the filterqueries are added, for keywords they are also parsed as edismax. The returned result is all the black dresses from 'Wenz'.

If anybody has a better solution to what I've posted I would be more than happy to read up on it as I'm quite new to Solr and I think my way is a bit convoluted to be honest.

Thanks.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top