Lucene: how to keep lithuanian language symbols in StandardAnalyzer?

https://stackoverflow.com/questions/20766858

21-09-2022
|

Domanda

I have done my own analyzer for unneccessary data and stop-words removal with Lucene (version 4.3.0).

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43, new CharArraySet(Version.LUCENE_43, stopWords, true));

Everything works as expected, but my language is lithuanian, so i would like to keep lithuanian language symbols: 'ĄČĘĖĮŠŲŪŽąčęėįšųūž'. The main problem that lithuanian language don't have own analyzer.. At the moment, words are truncated (without ĄČĘĖĮŠŲŪŽąčęėįšųūž symbols). Any suggestions how to override the format method/ keep these symbols ? I don't need the stemming tool.

Soluzione

My bad.. Yes StandardAnalyzer is not the problem here, I was reading data in wrong unicode format (UTF-8), which was written in Windows-1257. This produced unneccessary symbols, which were interpreted as rubbish. So changing it to the right unicode solved this isssue :)

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow