Question

I know there are many questions out there already concerning encoding/decoding. But this is driving me nuts and I'm in desperate need of some help.

I read in a file converting the lines to unicode

line = unicode(line,'latin-1')

Then, I do some mutations and try to write the contents back to a file, encoding the string like this

o_str = '%s,%s' % (new_sname, loc )
w_out.write(o_str.encode('latin-1'))

The file contains for instance the city name 'Genève' which is u'Gen\xc3\xa8ve' as unicode. Encoding it as 'Latin-1'

gue = gu.encode('iso-8859-1')

gives me on the console

>>> print gue
Genève

But in file my file it still is 'Genève'. Can somebody point me to what I am missing?

Was it helpful?

Solution

You are decoding UTF-8 data as Latin 1, use the correct codec instead:

>>> 'Gen\xc3\xa8ve'.decode('latin1')
u'Gen\xc3\xa8ve'
>>> print 'Gen\xc3\xa8ve'.decode('latin1')
Genève
>>> 'Gen\xc3\xa8ve'.decode('utf8')
u'Gen\xe8ve'
>>> print 'Gen\xc3\xa8ve'.decode('utf8')
Genève

The correct Unicode codepoint for the è letter is U+00E8, represented by \u00e8 or \xe8 in a Python Unicode literal, and the hex bytes C3A8 in UTF-8. Misintepreting C3 A8 leads to two unicode characters à and ¨, which you then write back to your file as C3 and A8 again because Latin1 maps one-on-one with Unicode.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top