I'm trying to implement a very basic wordcount example with MRJob. Everything works fine with ascii input, but when I mix cyrillic words into the input, I get something like this as an output
"\u043c\u0438\u0440" 1
"again!" 1
"hello" 2
"world" 1
As far as I understand, the first row above is the encoded single occurrence of cyrillic word "мир", which is a correct result with respect to my sample input text. Here is MR code
class MRWordCount(MRJob):
def mapper(self, key, line):
line = line.decode('cp1251').strip()
words = line.split()
for term in words:
yield term, 1
def reducer(self, term, howmany):
yield term, sum(howmany)
if __name__ == '__main__':
MRWordCount.run()
I'm using Python 2.7 and mrjob 0.4.2 on windows.
My questions are:
a) how do I manage to correctly produce readable cyrillic output on cyrillic input? b) what is the root cause of this behavior -- is it due to python/MR version or expected to work differently on non-windows -- any clues?
I'm reproducing the output of python -c "print u'мир'"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python27\lib\encodings\cp866.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to <undefined>