Domanda

We are moving our Hadoop cluster to a new setup using Cloudera CDH(5) and I am encountering an issue which we didn't have on our old (non cloudera) cluster.

Any text that is emitted from tasks on two (of ten) of the machines in the cluster emit ?? instead of è, à etc "fran??ois-mout??".

We are using Hadoop Streaming with a .net Mapper and Reducer using mono. I have set the Encoding inside the classes to UTF-8 and the $LOCALE on all machines (ubuntu 12.04) is the same en_GB.UTF-8.

I am close to running out of reasons for this error and really need a solution.

Thanks

È stato utile?

Soluzione

I found a solution to this thanks to http://www.stewh.com/2013/12/working-with-chinese-or-other-utf8-encoded-text-in-hadoop-streaming/

I needed to add -cmdenv LC_CTYPE=en_GB.UTF-8 to the end of my hadoop streaming command.

I am going to experiment with adding it to /etc/default/locale

~$ update-locale LC_CTYPE="en_GB.UTF-8"

for a permanent fix but for now this is a good solution.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top