Elasticsearch shows umlauts as "??"

https://stackoverflow.com/questions/19340737

30-06-2022
|

Question

Setup:

Ubuntu 12.04 Server installed via VMWare quick install
PostgreSQL 9.1
ElasticSearch 0.90
Mono 3.2.1
Rails 4
Nginx 1.4.2 + Passenger 4.0.16

I have a C# program that on start writes a new ElasticSearch index and points the alias that is used by the rails applications to it, the program then keeps going and watches a redis instance for things to update.

There is another C# program that scrapes data from web pages, once scraped they are put into Postgresql and the index writer above is notified via Redis. Those pages have varying encodings and are converted to UTF-8.

The first appearance of this bug was when I made a mistake and encoded data that was already UTF-8 as UTF-8 again.

Investigation

Now I thought that I obviously have some data corruption going on but the weird thing is: The umlauts are only corrupted when I start the indexing mono process from rails via nohup, if I kill this process and manually start it from the command line it works perfectly fine.

When I do a backup/restore of the database it works again from web interface but once the server is rebooted the umlauts are again replaced with ?? when starting the mono process from the web interface.

The first thing I did was to purge the affected rows from the database and scrape the data again (without encoding it twice), that didn't help and since the error only appears when running it as non-interactive via nohup from the rails application I assumed it is because of the locale setting so I changed that in both, /etc/defaults/locale and /etc/environment to en_US.UTF-8 and en_US:en but that did not help either.

I really have no idea what else I can do or what exactly causes this error, any help would be appreciated.

edit: I forgot to clarify the most important part, when umlauts are replaced with ?? ALL umlauts are replaced in every single document in the index.

Solution

Put this in the script that you use to start your process:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

The reason that your script only picks up the UTF-8 when you start things manually is that these things are not system wide. I've run into this with jruby and init.d scripts before and the solution is to not rely on defaults for this.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow