Question

When attempting to tokenize a string in python3.0, why do I get a leading 'utf-8' before the tokens start?

From the python3 docs, tokenize should now be used as follows:

g = tokenize(BytesIO(s.encode('utf-8')).readline)

However, when attempting this at the terminal, the following happens:

>>> from tokenize import tokenize
>>> from io import BytesIO
>>> g = tokenize(BytesIO('foo'.encode()).readline)
>>> next(g)
(57, 'utf-8', (0, 0), (0, 0), '')
>>> next(g)
(1, 'foo', (1, 0), (1, 3), 'foo')
>>> next(g)
(0, '', (2, 0), (2, 0), '')
>>> next(g)

What's with the utf-8 token that precedes the others? Is this supposed to happen? If so, then should I just always skip the first token?

[edit]

I have found that token type 57 is tokenize.ENCODING, which can easily be filtered out of the token stream if need be.

Was it helpful?

Solution

That's the coding cookie of the source. You can specify one explicitly:

# -*- coding: utf-8 -*-
do_it()

Otherwise Python assumes the default encoding, utf-8 in Python 3.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top