Question

I am writing a parser by using PLY. The question is similar to this one How to write a regular expression to match a string literal where the escape is a doubling of the quote character?. However, I use double-quote to open and close a string. For example:

"I do not know what \"A\" is"

I define the normal string lexer as:

t_NORMSTRING = r'"([^"\n]|(\\"))*"$'

and I have another lexer for a variable:

def t_VAR(t):
   r'[a-zA-Z_][a-zA-Z_0-9]*'

The problem is my lexer doesn't recognize "I do not know what \"A\" is" as a NORMSTRING token. It returns the error

Illegal character '"' at 1
Syntax error at 'LexToken(VAR,'do',10,210)'

Please let me know why it is not correct.

Was it helpful?

Solution

Having explored this issue with a little PLY program, I think your issue is related to the differences between handling raw and non-raw strings in the data handling, and not with the PLY parsing and lexical matching itself. (Just as a side note, there are minor differences between python V2 and python v3 in this area of string handling. I have restricted my code to python v2).

You only get the error you are seeing if you use a non-raw string or use input instead of raw_input. This is shown from my example code and results below:

Commands:

$ python --version
Python 2.7.5
$ python string.py
import sys

if ".." not in sys.path: sys.path.insert(0,"..")
import ply.lex as lex
tokens = (
    'NORMSTRING',
    'VAR'
)

def t_NORMSTRING(t):
     r'"([^"\n]|(\\"))*"$'
     print "String: '%s'" % t.value

def t_VAR(t):
   r'[a-zA-Z_][a-zA-Z_0-9]*'

t_ignore = ' \t\r\n'

def t_error(t):
    print "Illegal character '%s'" % t.value[0]
    t.lexer.skip(1)

lexer = lex.lex()

data = r'"I do not know what \"A\" is"'

print "Data: '%s'" % data

lexer.input(data)

while True:
   tok = lexer.token()
   if not tok: break
   print tok

Output:

Data: '"I do not know what \"A\" is"'
String: '"I do not know what \"A\" is"'
data = '"I do not know what \"A\" is"'

print "Data: '%s'" % data

lexer.input(data)

while True:
   tok = lexer.token()
   if not tok: break
   print tok

Output:

Data: '"I do not know what "A" is"'
Illegal character '"'
Illegal character '"'
String: '" is"'
lexer.input(raw_input("Please type your line: "));

while True:
   tok = lexer.token()
   if not tok: break
   print tok

Output:

Please type your line: "I do not know what \"A\" is"
String: '"I do not know what \"A\" is"'
lexer.input(input("Please type your line: "));

while True:
   tok = lexer.token()
   if not tok: break
   print tok

Output:

Please type your line: "I do not know what \"A\" is"
Illegal character '"'
Illegal character '"'

As a final note, You probably do not need the string anchor $ in your regular expression.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top