Some background:
I am writing a parser to retrieve information from sites with a markup language. Standard libraries as wikitools, ... do not work for me as I need to be more specific and adapting them to my needs puts a layer of complexity between me and the problem. Python + "simple" regex got me into difficulties identifying the dependencies between the different "tokens" in the markup language in a transparent manner - so obviously I needed to arrive at PLY at the end of this journey.
Now it seems that PLY identifies the tokens via regex differently compared to Python - but I can't find something on it. I don't want to move on in case I don't understand how PLY determines the tokens within its lexer (as otherwise I would have no control of the logic I am depending on and will fail in a later stage).
Here we go:
import ply.lex as lex
text = r'--- 123456 ---'
token1 = r'-- .* --'
tokens = (
'TEST',
)
t_TEST = token1
lexer = lex.lex(reflags=re.UNICODE, debug=1)
lexer.input(text)
for tok in lexer:
print tok.type, tok.value, tok.lineno, tok.lexpos
results in:
lex: tokens = ('TEST',)
lex: literals = ''
lex: states = {'INITIAL': 'inclusive'}
lex: Adding rule t_TEST -> '-- .* --' (state 'INITIAL')
lex: ==== MASTER REGEXS FOLLOW ====
lex: state 'INITIAL' : regex[0] = '(?P<t_TEST>-- .* --)'
TEST --- 123456 --- 1 0
The last line is surprising - I would have expected the first and the last -
to be missing in --- 123456 ---
in case it is comparable to "search" (and nothing in case it is comparable to "match"). Obviously this is important as then --
cannot be distinguished from ---
(or ===
from ===
), i.e. headlines, enumbering, ... cannot be differentiated.
So why does PLY behaves differently for standard Python/regex? (and how? - couldn't find something in the documentation, or here at stackoverflow).
I would guess it is more my understanding of PLY as the tool is around for quite some time already, i.e. this behavior is in there by intention I would guess. The only somehow related information I could find deals with different groups but does not explain a different behavior of identifying regexes itself. I found nothing in ply-hack as well.
Am I overlooking something stupid simple?
For comparison purposes here standard Python / regex:
import re
text = r'--- 123456 ---'
token1 = r'-- .* --'
p = re.compile(token1)
m = p.search(text)
if m:
print 'Match found: ', m.group()
else:
print 'No match'
m = p.match(text)
if m:
print 'Match found: ', m.group()
else:
print 'No match'
gives:
Match found: -- 123456 --
No match
(as expected, first is the result of "search", second of "match")
My settings: I am working with spyder - this is the terminal display at start:
Python 2.7.5+ (default, Sep 19 2013, 13:49:51)
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Imported NumPy 1.7.1, SciPy 0.12.0, Matplotlib 1.2.1
Type "scientific" for more details.
Thanks for your time and help.