Question
I'm using the python module PLY to write a parser, and I am implementing as I go. I have a simple rule to detect strings:
r'("|\').*("|\')'
When lexer errors are thrown I have this:
def t_error (t) :
print 'Illegal lexer input line ' + str(t.lineno) + ' ' + t.value[:16]
sys.exit(-1)
When I feed my parser the following input:
parse("preg_match('%^[\*\%]+$%', $keywords)")
I get back this in return:
Illegal lexer input line 1 %^[\*\%]+$%', $k
My questions are:
1) Why am I not parsing this string? It seems like my regex should properly handle this string.
2) How can I fix this?
edit:
I have narrowed the problem down a bit. The following strings throw illegal lexer input errors by themselves:
'%'
'^'
Solution
Even if this regex were working it isn't quite doing what you want it to, for example it would accept "this'
, which isn't really a string. This is also the cause of the "illegal lexer input"...
After having done it's job and found the first string in "preg_match('
the lexer is then upset when each of the next 11 characters %^[\*\%]+$%
are illegal (and not in t_ignore
), since they don't even start with "
or '
.
.
Try doing this with two cases for "
and '
: "Starts with quote, some things which aren't quote, ends with quote." That is:
r'("[^"]*")|(\'[^\']*\')'
Or, if you want to include escaped speech marks:
r'("(\\"|[^"])*")|(\'(\\\'|[^\'])*\')'