My bad.. Yes StandardAnalyzer
is not the problem here, I was reading data in wrong unicode format (UTF-8
), which was written in Windows-1257
. This produced unneccessary symbols, which were interpreted as rubbish. So changing it to the right unicode solved this isssue :)
Lucene: how to keep lithuanian language symbols in StandardAnalyzer?
-
21-09-2022 - |
Question
I have done my own analyzer for unneccessary data and stop-words removal with Lucene (version 4.3.0).
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43, new CharArraySet(Version.LUCENE_43, stopWords, true));
Everything works as expected, but my language is lithuanian, so i would like to keep lithuanian language symbols: 'ĄČĘĖĮŠŲŪŽąčęėįšųūž'. The main problem that lithuanian language don't have own analyzer.. At the moment, words are truncated (without ĄČĘĖĮŠŲŪŽąčęėįšųūž symbols). Any suggestions how to override the format method/ keep these symbols ? I don't need the stemming tool.
La solution
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow