Parsing csv file with english and hindi characters in python

https://stackoverflow.com/questions/17661093

03-06-2022
|

Question

I am trying to parse a csv file which has both english and hindi characters and I am using utf-16. It works fine but as soon as it hits the hindi charatcer it fails. I am at a loss here.

Heres the code -->

import csv
import codecs

csvReader = csv.reader(codecs.open('/home/kuberkaul/Downloads/csv.csv', 'rb', 'utf-16'))
for row in csvReader:
        print row

The error that I get is Traceback (most recent call last):

>  File "csvreader.py", line 8, in <module>
>     for row in csvReader: UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-18: ordinal not in range(128)
> kuberkaul@ubuntu:~/Desktop$

How do I solve this ?

Edit 1:

I tried the solutions and used unicdoe csv reader and now it gives the error :

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

The code is :

import csv
import codecs, io


def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

filename = '/home/kuberkaul/Downloads/csv.csv'
reader = unicode_csv_reader(codecs.open(filename))
  print reader
for rows in reader:
  print rows

Solution

As the documentation says, in a big Note near the top:

This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

If you follow link to the example, it shows you the solution: Encode each line to UTF-8 before passing it to csv. They even give you a nice wrapper, so you can just replace the csv.reader with unicode_csv_reader and the rest of your code is unchanged:

csvReader = unicode_csv_reader(codecs.open('/home/kuberkaul/Downloads/csv.csv', 'rb', 'utf-16'))
for row in csvReader:
    print row

Of course the print isn't going to be very useful, as the str of a list uses the repr of each element, so you're going to get something like [u'foo', u'bar', u'\u0910\u0911']…

You can fix that in the usual ways—e.g., print u', '.join(row) will work if you remember the u, and if Python is able to guess your terminal's encoding (which it can on Mac and modern linux, but may not be able to on Windows and old linux, in which case you'll need to map an explicit encode over each column).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow