Python who to write to file encoding 'latin-1'

https://stackoverflow.com/questions/18251339

24-06-2022
|

Question

I know there are many questions out there already concerning encoding/decoding. But this is driving me nuts and I'm in desperate need of some help.

I read in a file converting the lines to unicode

line = unicode(line,'latin-1')

Then, I do some mutations and try to write the contents back to a file, encoding the string like this

o_str = '%s,%s' % (new_sname, loc )
w_out.write(o_str.encode('latin-1'))

The file contains for instance the city name 'Genève' which is u'Gen\xc3\xa8ve' as unicode. Encoding it as 'Latin-1'

gue = gu.encode('iso-8859-1')

gives me on the console

>>> print gue
Genève

But in file my file it still is 'GenÃ¨ve'. Can somebody point me to what I am missing?

Solution

You are decoding UTF-8 data as Latin 1, use the correct codec instead:

>>> 'Gen\xc3\xa8ve'.decode('latin1')
u'Gen\xc3\xa8ve'
>>> print 'Gen\xc3\xa8ve'.decode('latin1')
GenÃ¨ve
>>> 'Gen\xc3\xa8ve'.decode('utf8')
u'Gen\xe8ve'
>>> print 'Gen\xc3\xa8ve'.decode('utf8')
Genève

The correct Unicode codepoint for the è letter is U+00E8, represented by \u00e8 or \xe8 in a Python Unicode literal, and the hex bytes C3A8 in UTF-8. Misintepreting C3 A8 leads to two unicode characters Ã and ¨, which you then write back to your file as C3 and A8 again because Latin1 maps one-on-one with Unicode.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow