質問

I have some files with unicode data, the following code works fine when working with CPython to read those files, whereas the code crashes on IronPython saying "failed to decode bytes at index 67"

for f in self.list_of_files:
            all_words_in_file = []

            with codecs.open(f,encoding="utf-8-sig") as file_obj:
                for line in file_obj:
                    all_words_in_file.extend(line.split(" "))

            #print "Normalising unicode strings"

            normal_list = []
            #gets all the words and remove duplicate words 
            #the list will contain unique normalized words
            for l in all_words_in_file:
                    normal_list.append(normalize('NFKC',l))

            file_listing.update({f:normal_list})
        return file_listing

I cannot understand the reason, is there another way to read unicode data in ironpython?

役に立ちましたか?

解決

How about this one:

def lines(filename):
    f = open(filename, "rb")
    yield f.readline()[3:].strip().decode("utf-8")
    for line in f:
        yield line.strip().decode("utf-8")
    f.close()

for line in lines("text-utf8-with-bom.txt"):
    all_words_in_file.extend(line.split(" "))

I have also filed a IronPython bug https://ironpython.codeplex.com/workitem/34951

As long as you are feeding entire lines to decode, things will be ok.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top