DictReader and UnicodeError

Question 1

The Python 2 CSV module does not handle Unicode input.

Open the file in binary mode, and decode after parsing it as CSV. This is safe for the UTF-8 codec as newlines, delimiters and quotes all encode to 1 byte.

The csv module documentation includes a UnicodeReader wrapper class in the example section that will do the decoding for you; it is easily adapted to the DictReader class:

import csv

class UnicodeDictReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        self.encoding = encoding
        self.reader = csv.DictReader(f, dialect=dialect, **kwds)

    def next(self):
        row = self.reader.next()
        return {k: unicode(v, "utf-8") for k, v in row.iteritems()}

    def __iter__(self):
        return self

Use this with the file opened in binary mode:

def openFile(fileName):
    try: 
        trainFile  = open(fileName, "rb")
    except IOError as e:
        print "File could not be opened: {}".format(e)
    else:
        return UnicodeDictReader(trainFile)

Question 2

I can't give a comment to Martijn, which solution works for me perfectly after little upgrade which I leave here for others:

    def next(self):
    row = self.reader.next()
    try:
        d = dict((unicode(k, self.encoding), unicode(v, self.encoding)) for k, v in row.iteritems())
    except TypeError:
        d = row
    return d

One thing is that python 2.6 and lower doesn't support dict comprahension. Another, that dicts can use different types, and unicode function not, so it's worth to catch TypeError in case of null or number. One more thing which drive me creazy was, it doesn't work when you open file with encoding! Just leave it simple open().