Encoding issue when reading file in Python

https://stackoverflow.com/questions/21749442

10-10-2022
|

Question

I have a file containing

    foo = "Gro\xdfbritannien"

I'm using the following, but it always displays the original text with the \x

    import codecs
    f = codecs.open('myfile', 'r', 'utf8')
    for line in f:
      print line
      print line.encode('utf-8')
      print line.decode('utf-8')

I can't see how to display the proper encoded text, as when I'm doing

    >>> print u'Gro\xdfbritannien'
    Großbritannien

Any hint would be appreciated!

Solution

When your file contains the line

foo = "Gro\xdfbritannien"

it contains an actual backslash character, followed by x , d and f. So if that line is read into a Python string, it is read as

'foo = "Gro\\xdfbritannien"'

(and since those are all ASCII characters, it doesn't matter if you open it with the utf-8 codec or not).

So you need to decode it first using the string_escape codec:

>>> foo.decode("string_escape")
'Gro\xdfbritannien'

and then decode it to the correct Unicode object

>>> _.decode("latin1")
u'Gro\xdfbritannien'

which you can then print

>>> print _
Großbritannien

OTHER TIPS

There is no business of codec. You should do like this 'foo = "Gro\xdfbritannien"'

>>> print u'Gro\\xdfbritannien'
Gro\xdfbritannien

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow