How to avoid unicodeError?

https://stackoverflow.com/questions/17708506

03-06-2022
|

Question

I'm trying to write to a file and I get the following error:

Traceback (most recent call last):
  File "/private/var/folders/jv/9_sy0bn10mbdft1bk9t14qz40000gn/T/Cleanup At Startup/merge-395780681.888.py", line 151, in <module>
    gc_all_d.writerow(row)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 148, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0329' in position 5: ordinal not in range(128)

The error occurs after I try to write a row from a database of counselors to a file that is aggregating their names:

# compile master spreadsheet
with(open('gc_all.txt_3','w')) as gc_all:
    gc_all_d = csv.DictWriter(gc_all,  fieldnames = fieldnames, extrasaction='ignore', delimiter = '\t') 
    gc_all_d.writeheader()
    for row in aicep_l:
        print row['name']
        gc_all_d.writerow(row)
    for row in nbcc_l:
        gc_all_d.writerow(row)
        print row['name']

I'm in unfamiliar waters here. I don't see a parameter in the writerow() method that can widen the encoding range to this character '\u0329'.

I think that the error may have something to do with the fact that I'm using the nameparser module to organize all of the counselors' names into the same formats. The HumanName function imported from nameparser might write out the counselors' names with a leading 'u' to signify unicode, meaning that the total output u'Sam the Man' instead of 'Sam the Man' is not recognized.

Thanks for the help!

ERROR following amendment based on answer:

  File "/private/var/folders/jv/9_sy0bn10mbdft1bk9t14qz40000gn/T/Cleanup At Startup/merge-395782963.700.py", line 153, in <module>
    row['name'] = row['name'].encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 11: ordinal not in range(128)

Code that makes all of the name entries uniform:

# nbcc
with(open('/Users/samuelfinegold/Documents/noodle/gc/nbcc/nbcc_output.txt', 'rU')) as nbcc:
    nbcc_d = csv.DictReader(nbcc, delimiter = '\t')
    nbcc_l = []
    for row in nbcc_d:
#         name = HumanName(row['name'])
#         row['name'] = name.title + ' ' + name.first + ' ' + name.middle + ' ' + name.last + ' ' + name.suffix       
        row['phone'] = row['phone'].translate(None, whitespace + punctuation)
        nbcc_l.append(row)

Amended code:

# compile master spreadsheet
with(open('gc_all.txt_3','w')) as gc_all:
    gc_all_d = csv.DictWriter(gc_all,  fieldnames = fieldnames, extrasaction='ignore', delimiter = '\t') 
    gc_all_d.writeheader()
    for row in nbcc_l:
        row['name'] = row['name'].encode('utf-8')
        gc_all_d.writerow(row)

Error:

Traceback (most recent call last):
  File "/private/var/folders/jv/9_sy0bn10mbdft1bk9t14qz40000gn/T/Cleanup At Startup/merge-395784700.086.py", line 153, in <module>
    row['name'] = row['name'].encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 11: ordinal not in range(128)
logout

Solution

From the docs:

This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

You'll need to encode your data before writing it - something like:

for row in aicep_1:
    print row['name']
    for key, value in row.iteritems():
        row[key] = value.encode('utf-8')
    gc_all_d.writerow(row)

Or, since you're on 2.7, you can use a dictionary comprehension:

for row in aicep_1:
    print row['name']
    row = {key, value.encode('utf-8') for key, value in row.iteritems()}

Or use some of the more sophisticated patterns on the examples page in the docs.

OTHER TIPS

What you have is an output stream (your gc_all.txt_3 file, opened on the with line, stream instance in variable gc_all) that Python believes must hold nothing but ASCII. You've asked it to write a Unicode string with the Unicode character '\u0329'. For instance:

>>> s = u"foo\u0329bar"
>>> with open('/tmp/unicode.txt', 'w') as stream: stream.write(s)
...

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0329' in position 3:
ordinal not in range(128)

You have a bunch of options, including doing an explicit .encode on each string. Or, you can open the file with codecs.open as described in http://docs.python.org/2/howto/unicode.html (I'm assuming Python 2.x, 3.x is a little different):

>>> import codecs
>>> with codecs.open('/tmp/unicode.txt', 'w', encoding='utf-8') as stream:
...     stream.write(s)
... 
>>>

Edit to add: based on @Peter DeGlopper's answer, explicit encode may be safer. UTF-8 has no NULs in its encoding so assuming you want UTF-8, and usually one does, this may be OK.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow