Question

I have some files with unicode data, the following code works fine when working with CPython to read those files, whereas the code crashes on IronPython saying "failed to decode bytes at index 67"

for f in self.list_of_files:
            all_words_in_file = []

            with codecs.open(f,encoding="utf-8-sig") as file_obj:
                for line in file_obj:
                    all_words_in_file.extend(line.split(" "))

            #print "Normalising unicode strings"

            normal_list = []
            #gets all the words and remove duplicate words 
            #the list will contain unique normalized words
            for l in all_words_in_file:
                    normal_list.append(normalize('NFKC',l))

            file_listing.update({f:normal_list})
        return file_listing

I cannot understand the reason, is there another way to read unicode data in ironpython?

Was it helpful?

Solution

How about this one:

def lines(filename):
    f = open(filename, "rb")
    yield f.readline()[3:].strip().decode("utf-8")
    for line in f:
        yield line.strip().decode("utf-8")
    f.close()

for line in lines("text-utf8-with-bom.txt"):
    all_words_in_file.extend(line.split(" "))

I have also filed a IronPython bug https://ironpython.codeplex.com/workitem/34951

As long as you are feeding entire lines to decode, things will be ok.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top