Python CSV file UTF-16 to UTF-8 print error

Question 1

The csv module can not handle Unicode input. It says so specifically on its documentation page:

Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe;

You need to convert your CSV file to UTF-8 so that the module can deal with it:

with codecs.open(file_full_path, 'rU', 'UTF-16') as infile:
    with open(file_full_path + '.utf8', 'wb') as outfile:
        for line in infile:
            outfile.write(line.encode('utf8'))

Alternatively, you can use the command-line utility iconv to convert the file for you.

Then use that re-coded file to read your data:

 reader = csv.reader(open(file_full_path + '.utf8', 'rb'), delimiter='\t', quotechar='"')
 for row in reader:
     print [c.decode('utf8') for c in row]

Note that the columns then need decoding to unicode manually.

Question 2

Encode errors is what you get when you try to convert unicode characters to 8-bit sequences. So your first error is not an error get when actually reading the file, but a bit later.

You probably get this error because the Python 2 CSV module expects the files to be in binary mode, while you opened it so it returns unicode strings.

Change your opening to this:

reader = csv.reader(open(file_full_path, 'rb'), delimiter='\t', quotechar='"')

And you should be fine. Or even better:

with open(file_full_path, 'rb') as infile:
    reader = csv.reader(infile, delimiter='\t', quotechar='"')
    # CVS handling here.

However, you can't use UTF-16 (or UTF-32), as the separation characters are two-byte characters in UTF-16, and it will not handle this correctly, so you will need to convert it to UTF-8 first.