Python3.0: tokenize & BytesIO
Question
When attempting to tokenize
a string in python3.0, why do I get a leading 'utf-8'
before the tokens start?
From the python3 docs, tokenize
should now be used as follows:
g = tokenize(BytesIO(s.encode('utf-8')).readline)
However, when attempting this at the terminal, the following happens:
>>> from tokenize import tokenize
>>> from io import BytesIO
>>> g = tokenize(BytesIO('foo'.encode()).readline)
>>> next(g)
(57, 'utf-8', (0, 0), (0, 0), '')
>>> next(g)
(1, 'foo', (1, 0), (1, 3), 'foo')
>>> next(g)
(0, '', (2, 0), (2, 0), '')
>>> next(g)
What's with the utf-8
token that precedes the others? Is this supposed to happen? If so, then should I just always skip the first token?
[edit]
I have found that token type 57 is tokenize.ENCODING, which can easily be filtered out of the token stream if need be.
Solution
That's the coding cookie of the source. You can specify one explicitly:
# -*- coding: utf-8 -*-
do_it()
Otherwise Python assumes the default encoding, utf-8 in Python 3.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow