Frage

I have a utf-8 encoded text file and i want to tokenize each line using split as a simple tokenizer. the code is like here:

import codecs
file = codecs.open(fileAddress, 'r', 'utf-8')
line = file.readline()
file.close()
line.split()

This doesn't split the utf-8 string as i use on ascii files. I want a line like "hi i am here" which is in utf-8 encoding to become a list of tokens like ["hi", "i", "am", "here"] which is easy with ascii using line.split().

Is there any simple solution to this problem?

War es hilfreich?

Lösung

As pointed out by Martijn Pieters,Your code should work fine, provided your file have regular spaces as separators. The only difference from the result you are expecting is that the tokens will of the type unicode and not str.

There are some other unicode characters used to represent whitespaces http://en.wikipedia.org/wiki/Whitespace_character#Unicode, maybe this is causing the mess, even the readline can be problematic if this is the case...

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top