Question

I have a utf-8 encoded text file and i want to tokenize each line using split as a simple tokenizer. the code is like here:

import codecs
file = codecs.open(fileAddress, 'r', 'utf-8')
line = file.readline()
file.close()
line.split()

This doesn't split the utf-8 string as i use on ascii files. I want a line like "hi i am here" which is in utf-8 encoding to become a list of tokens like ["hi", "i", "am", "here"] which is easy with ascii using line.split().

Is there any simple solution to this problem?

Was it helpful?

Solution

As pointed out by Martijn Pieters,Your code should work fine, provided your file have regular spaces as separators. The only difference from the result you are expecting is that the tokens will of the type unicode and not str.

There are some other unicode characters used to represent whitespaces http://en.wikipedia.org/wiki/Whitespace_character#Unicode, maybe this is causing the mess, even the readline can be problematic if this is the case...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top