Question

The following sample code:

import token, tokenize, StringIO

def generate_tokens(src):
    rawstr = StringIO.StringIO(unicode(src))
    tokens = tokenize.generate_tokens(rawstr.readline)
    for i, item in enumerate(tokens):
        toktype, toktext, (srow,scol), (erow,ecol), line = item
        print i, token.tok_name[toktype], toktext

s = \
"""
 def test(x):
     \"\"\" test with an unterminated docstring
"""

generate_tokens(s)

causes the following to fire:

... (stripped a little)
File "/usr/lib/python2.6/tokenize.py", line 296, in generate_tokens
    raise TokenError, ("EOF in multi-line string", strstart)
tokenize.TokenError: ('EOF in multi-line string', (3, 5))

Some questions about this behaviour:

  1. Should I catch and 'selectively' ignore tokenize.TokenError here? Or should I stop trying to generate tokens from non-compliant/non-complete code? If so, how would I check for that?
  2. Can this error (or similar errors) be caused by anything other than an unterminated docstring?
Was it helpful?

Solution

How you handle tokenize errors depends entirely on why you are tokenizing. You code gives you all the valid tokens up until the beginning of the bad string literal. If that token stream is useful to you, then use it.

You have a few options about what to do with the error:

  1. You could ignore it and have an incomplete token stream.

  2. You could buffer all the tokens and only use the token stream if no error occurred.

  3. You could process the tokens, but abort the higher-level processing if an error occurred.

As to whether that error can happen with anything other than an incomplete docstring, yes. Remember that docstrings are just string literals. Any unterminated multi-line string literal will give you the same error. Similar errors could happen for other lexical errors in the code.

For example, here are other values of s that produce errors (at least with Python 2.5):

s = ")"  # EOF in multi-line statement
s = "("  # EOF in multi-line statement
s = "]"  # EOF in multi-line statement
s = "["  # EOF in multi-line statement
s = "}"  # EOF in multi-line statement
s = "{"  # EOF in multi-line statement

Oddly, other nonsensical inputs produce ERRORTOKEN values instead:

s = "$"
s = "'"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top