Question

I have Solr with indexed database. In my database all data is in Latvian. The problem is, I need to be able to search word Riga as if it is word Rīga. Of course, i can define synonym - Rīga = Riga, but can i just define, that letter ī is letter i? I read something about solr.ISOLatin1AccentFilterFactory, but as far as i understood, this is not for UTF-8 encoding, right? Advices?

Was it helpful?

Solution

Used PatternReplaceFilterFactory with index and query. Seems to be working right.

OTHER TIPS

ISOLatin1AccentFilterFactory is exactly what you are looking for... as long as the accent EXISTS in the latin-1 character set (lower 7 bits of UTF-8 are identical to latin-1). The ī that you mentioned doesn't appear to exist in ISO-8859-1 so ISOLatin1AccentFilterFactory won't work in this SPECIFIC case. I would still recommend that you use ISOLatin1AccentFilterFactory in addition to any exceptions that you take care of using PatternReplaceFilterFactory as there probably are some Latvian characters that it will help (assuming, I don't have experience with Latvian)

FYI, I did actually try the against my Solr setup with ISOLatin1AccentFilterFactory and it didn't help this case.

Look at ICUTokenizerFactory which provides Unicode character normalization. Extremely useful and very easy.

http://lucene.apache.org/solr/api/org/apache/solr/analysis/ICUTokenizerFactory.html

http://site.icu-project.org/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top