Python unicode normalization issue

https://stackoverflow.com/questions/21528127

06-10-2022
|

Question

I am trying to convert a file that contains some unicode characters in it and replace it with normal characters. I am facing some problem with that and get the following error.

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3: invalid data

My file looks like below:

ecteDV
ecteBl
agnéto

the code to replace accents is shown below:

 #!/usr/bin/python
 # -*- coding: utf-8 -*-

 import re, sys, unicodedata, codecs

 f = codecs.open(sys.argv[1], 'r', 'utf-8')
 for line in f:
     name = line.lower().strip()
     normal = unicodedata.normalize('NFKD', name).encode('ASCII', 'ignore')
     print normal
 f.close()

Is there a way I can replace all the accents and normalize the contents of the file?

Solution

Consider that your file is perhaps not using UTF-8 as the encoding.

You are reading the file with the UTF-8 codec but decoding fails. Check that your file encoding is really UTF-8.

Note that UTF-8 is an encoding out of many, it doesn't mean 'decode magically to Unicode'.

If you don't yet understand what encodings are (as opposed to what Unicode is, a related but separate concept), you need to do some reading:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

OTHER TIPS

Try opening the file with the following code and replace sys.argv[1] with the filename.

import re, sys, unicodedata, codecs

with codecs.open("filename", 'r', 'utf-8') as f:
    for line in f:
        do something 

 f.close()

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow