How to handle a tokenize error with unterminated multiline comments (python 2.6)
-
08-07-2019 - |
Question
The following sample code:
import token, tokenize, StringIO
def generate_tokens(src):
rawstr = StringIO.StringIO(unicode(src))
tokens = tokenize.generate_tokens(rawstr.readline)
for i, item in enumerate(tokens):
toktype, toktext, (srow,scol), (erow,ecol), line = item
print i, token.tok_name[toktype], toktext
s = \
"""
def test(x):
\"\"\" test with an unterminated docstring
"""
generate_tokens(s)
causes the following to fire:
... (stripped a little)
File "/usr/lib/python2.6/tokenize.py", line 296, in generate_tokens
raise TokenError, ("EOF in multi-line string", strstart)
tokenize.TokenError: ('EOF in multi-line string', (3, 5))
Some questions about this behaviour:
- Should I catch and 'selectively' ignore tokenize.TokenError here? Or should I stop trying to generate tokens from non-compliant/non-complete code? If so, how would I check for that?
- Can this error (or similar errors) be caused by anything other than an unterminated docstring?
Solution
How you handle tokenize errors depends entirely on why you are tokenizing. You code gives you all the valid tokens up until the beginning of the bad string literal. If that token stream is useful to you, then use it.
You have a few options about what to do with the error:
You could ignore it and have an incomplete token stream.
You could buffer all the tokens and only use the token stream if no error occurred.
You could process the tokens, but abort the higher-level processing if an error occurred.
As to whether that error can happen with anything other than an incomplete docstring, yes. Remember that docstrings are just string literals. Any unterminated multi-line string literal will give you the same error. Similar errors could happen for other lexical errors in the code.
For example, here are other values of s that produce errors (at least with Python 2.5):
s = ")" # EOF in multi-line statement
s = "(" # EOF in multi-line statement
s = "]" # EOF in multi-line statement
s = "[" # EOF in multi-line statement
s = "}" # EOF in multi-line statement
s = "{" # EOF in multi-line statement
Oddly, other nonsensical inputs produce ERRORTOKEN values instead:
s = "$"
s = "'"