Question

def openFile(fileName):
    try:
       trainFile  = io.open(fileName,"r",encoding = "utf-8")
    except IOError as e:
       print ("File could not be opened: {}".format(e))
    else:
       trainData = csv.DictReader(trainFile)
       print trainData
       return trainData

def computeTFIDF(trainData):
     bodyList = []
     print "Inside computeTFIDF"
     for row in trainData:
        for key, value in row.iteritems():
             print key, unicode(value, "utf-8", "ignore")
     print "Done"
     return

 if __name__ == "__main__":
     print "Main"
     trainData = openFile("../Data/TrainSample.csv")
     print "File Opened"
     computeTFIDF(trainData)

Error:

Traceback (most recent call last):
  File "C:\DebSeal\IUB MS Program\IUB Sem III\Facebook Kaggle Comp\Src\facebookChallenge.py", line 62, in <module>
    computeTFIDF(trainData)
  File "C:\DebSeal\IUB MS Program\IUB Sem III\Facebook Kaggle Comp\Src\facebookChallenge.py", line 42, in computeTFIDF
    for row in trainData:
  File "C:\Python27\lib\csv.py", line 104, in next
    row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 215: ordinal not in range(128)

TrainSample.csv: Is a csv file with 4 columns (with header).
OS: Windows 7 64 bit.
Using Python 2.x

I don't know what is going wrong here. I said it to ignore the encoding. But still is throws the same error.

I think before the control reaches the encoding, it throws an error.

Can anybody tell me where I am going wrong.

Was it helpful?

Solution

The Python 2 CSV module does not handle Unicode input.

Open the file in binary mode, and decode after parsing it as CSV. This is safe for the UTF-8 codec as newlines, delimiters and quotes all encode to 1 byte.

The csv module documentation includes a UnicodeReader wrapper class in the example section that will do the decoding for you; it is easily adapted to the DictReader class:

import csv

class UnicodeDictReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        self.encoding = encoding
        self.reader = csv.DictReader(f, dialect=dialect, **kwds)

    def next(self):
        row = self.reader.next()
        return {k: unicode(v, "utf-8") for k, v in row.iteritems()}

    def __iter__(self):
        return self

Use this with the file opened in binary mode:

def openFile(fileName):
    try: 
        trainFile  = open(fileName, "rb")
    except IOError as e:
        print "File could not be opened: {}".format(e)
    else:
        return UnicodeDictReader(trainFile)

OTHER TIPS

I can't give a comment to Martijn, which solution works for me perfectly after little upgrade which I leave here for others:

    def next(self):
    row = self.reader.next()
    try:
        d = dict((unicode(k, self.encoding), unicode(v, self.encoding)) for k, v in row.iteritems())
    except TypeError:
        d = row
    return d

One thing is that python 2.6 and lower doesn't support dict comprahension. Another, that dicts can use different types, and unicode function not, so it's worth to catch TypeError in case of null or number. One more thing which drive me creazy was, it doesn't work when you open file with encoding! Just leave it simple open().

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top