Question

There is a number of topics on this problem around the web, but I can not seem to find the answer for my specific case.

I have a CSV file. I am not sure what was was done to it, but when I try to open it, I get:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

Here is a full Traceback:

Traceback (most recent call last):
  File "keywords.py", line 31, in <module>
    main()
  File "keywords.py", line 28, in main
    get_csv(file_full_path)
  File "keywords.py", line 19, in get_csv
    for row in reader:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10:    ordinal    not in range(128)

With the help of Stack Overflow, I got it open with:

reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')

Now the problem is that when I am reading the file:

def get_csv(file_full_path):
    import csv, codecs
    reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
    for row in reader:
        print row

I get stuck on Asian symbols:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)

I have tried decode, 'encode', unicode() on the string containing that character, but it does not seem help.

for row in reader:
    #decoded_row = [element_s.decode('UTF-8') for element_s in row]
    #print decoded_row
    encoded_row = [element_s.encode('UTF-8') for element_s in row]
    print encoded_row

At this point I do not really understand why. If I

>>> print u'\u5a07'
娇

or

>>> print '娇'
娇

it works. Also in terminal, it also works. I have checked The default encoding on terminal and Python shell, it is UTF-8 everywhere. And it prints that symbol easily. I assume that it has something to do with me opening file with codecs using UTF-16.

I am not sure where to go from here. Could anyone help out?

Was it helpful?

Solution

The csv module can not handle Unicode input. It says so specifically on its documentation page:

Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe;

You need to convert your CSV file to UTF-8 so that the module can deal with it:

with codecs.open(file_full_path, 'rU', 'UTF-16') as infile:
    with open(file_full_path + '.utf8', 'wb') as outfile:
        for line in infile:
            outfile.write(line.encode('utf8'))

Alternatively, you can use the command-line utility iconv to convert the file for you.

Then use that re-coded file to read your data:

 reader = csv.reader(open(file_full_path + '.utf8', 'rb'), delimiter='\t', quotechar='"')
 for row in reader:
     print [c.decode('utf8') for c in row]

Note that the columns then need decoding to unicode manually.

OTHER TIPS

Encode errors is what you get when you try to convert unicode characters to 8-bit sequences. So your first error is not an error get when actually reading the file, but a bit later.

You probably get this error because the Python 2 CSV module expects the files to be in binary mode, while you opened it so it returns unicode strings.

Change your opening to this:

reader = csv.reader(open(file_full_path, 'rb'), delimiter='\t', quotechar='"')

And you should be fine. Or even better:

with open(file_full_path, 'rb') as infile:
    reader = csv.reader(infile, delimiter='\t', quotechar='"')
    # CVS handling here.

However, you can't use UTF-16 (or UTF-32), as the separation characters are two-byte characters in UTF-16, and it will not handle this correctly, so you will need to convert it to UTF-8 first.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top