SOLR Dropping Emoji Miscellaneous characters

https://stackoverflow.com/questions/19773786

03-07-2022
|

Question

It looks like SOLR is considering what should be valid Unicode characters as invalid, and dropping them.

I "proved" this by turning on query debug to see what the parser was doing with my query. Here's an example:

Query = 'ァ☀' (\u30a1\u2600)

Here's what SOLR did with it:

'debug':{ 'rawquerystring':u'\u30a1\u2600', 'querystring':u'\u30a1\u2600', 'parsedquery':u'(+DisjunctionMaxQuery((text:\u30a1)))/no_coord', 'parsedquery_toString':u'+(text:\u30a1)',

As you can see, was OK with 'ァ', but it ATE the "Black Sun" character.

I haven't tried ALL of the Block, but I've confirmed it also doesn't like ⛿ (\u26ff) and ♖ (\u2656).

I'm using SOLR with Jetty, so the various TomCat issues WRT character encoding shouldn't apply.

Solution

This very likely has more to do with the Analyzer. I don't see anything specifying the treatment of those sorts of characters exactly, but they are probably being treated very much as punctuation by the StandardAnalyzer (or whatever Analyzer you may be using), and so will not be present in the final query. StandardAnalyzer implements the rules set forward in UAX-29, Unicode Text Segmentation, in order to separate input into tokens.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow