You are decoding UTF-8 data as Latin 1, use the correct codec instead:
>>> 'Gen\xc3\xa8ve'.decode('latin1')
u'Gen\xc3\xa8ve'
>>> print 'Gen\xc3\xa8ve'.decode('latin1')
Genève
>>> 'Gen\xc3\xa8ve'.decode('utf8')
u'Gen\xe8ve'
>>> print 'Gen\xc3\xa8ve'.decode('utf8')
Genève
The correct Unicode codepoint for the è
letter is U+00E8
, represented by \u00e8
or \xe8
in a Python Unicode literal, and the hex bytes C3A8 in UTF-8. Misintepreting C3 A8 leads to two unicode characters Ã
and ¨
, which you then write back to your file as C3 and A8 again because Latin1 maps one-on-one with Unicode.