Question

I am trying to convert a file that contains some unicode characters in it and replace it with normal characters. I am facing some problem with that and get the following error.

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3: invalid data

My file looks like below:

ecteDV
ecteBl
agnéto

the code to replace accents is shown below:

 #!/usr/bin/python
 # -*- coding: utf-8 -*-

 import re, sys, unicodedata, codecs

 f = codecs.open(sys.argv[1], 'r', 'utf-8')
 for line in f:
     name = line.lower().strip()
     normal = unicodedata.normalize('NFKD', name).encode('ASCII', 'ignore')
     print normal
 f.close()

Is there a way I can replace all the accents and normalize the contents of the file?

Was it helpful?

Solution

Consider that your file is perhaps not using UTF-8 as the encoding.

You are reading the file with the UTF-8 codec but decoding fails. Check that your file encoding is really UTF-8.

Note that UTF-8 is an encoding out of many, it doesn't mean 'decode magically to Unicode'.

If you don't yet understand what encodings are (as opposed to what Unicode is, a related but separate concept), you need to do some reading:

OTHER TIPS

Try opening the file with the following code and replace sys.argv[1] with the filename.

import re, sys, unicodedata, codecs

with codecs.open("filename", 'r', 'utf-8') as f:
    for line in f:
        do something 

 f.close()
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top