Ideographic space in solr query

https://stackoverflow.com/questions/18490749

26-06-2022
|

Question

I have an issue with solr which I don't seem to be able to get over with...

When searching for "マルチェロブラック" (with a normal space between words) I'm getting expected results (15 of them). But when searching for "マルチェロ　ブラック" (which essentially has an ideographic space \u3000 between the words instead of a normal one) I'm not getting any results.

My fieldType configuration is pretty basic:

<fieldType name="text_cjk" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.CJKTokenizerFactory"/>
  </analyzer>
</fieldType>

I've tried adding

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-japanese.txt"/>

with mapping like

"\u3000" => "\u0020"

or even

"\u3000" => " "

but that didn't help.

Also tried adding

<filter class="solr.PositionFilterFactory" />

as suggested in Language Analysis: Chinese, Japanese, Korean, but then started getting 200+ results for the first search, and 1000+ results for the second. No good either.

Running solr version 3.5, so using CJKBigramFilterFactory is out of question. (Just saying, no idea really if that would help anyhow.)

Read quite a lot of Japanese blogs on solr configuration (thanks Google Chrome for making this so easy!), but all the examples have just that CJKBigramFilterFactory, sometimes with extra LowerCaseFilterFactory, but nothing that would seem to help in my case.

Any ideas what else could I try to make this work?

Solution

Well, it actually turned out to be an issue with how Drupal module Search API parses the query string before even passing it to solr. Fixed with a small patch to the module, see issue Split query on whitespace, not only on space.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow