Domanda

Some background:

I am writing a parser to retrieve information from sites with a markup language. Standard libraries as wikitools, ... do not work for me as I need to be more specific and adapting them to my needs puts a layer of complexity between me and the problem. Python + "simple" regex got me into difficulties identifying the dependencies between the different "tokens" in the markup language in a transparent manner - so obviously I needed to arrive at PLY at the end of this journey.

Now it seems that PLY identifies the tokens via regex differently compared to Python - but I can't find something on it. I don't want to move on in case I don't understand how PLY determines the tokens within its lexer (as otherwise I would have no control of the logic I am depending on and will fail in a later stage).

Here we go:

import ply.lex as lex

text = r'--- 123456 ---'
token1 = r'-- .* --'
tokens = (
   'TEST',
)
t_TEST = token1

lexer = lex.lex(reflags=re.UNICODE, debug=1)
lexer.input(text)
for tok in lexer:
    print tok.type, tok.value, tok.lineno, tok.lexpos

results in:

lex: tokens   = ('TEST',)
lex: literals = ''
lex: states   = {'INITIAL': 'inclusive'}
lex: Adding rule t_TEST -> '-- .* --' (state 'INITIAL')
lex: ==== MASTER REGEXS FOLLOW ====
lex: state 'INITIAL' : regex[0] = '(?P<t_TEST>-- .* --)'
TEST --- 123456 --- 1 0

The last line is surprising - I would have expected the first and the last - to be missing in --- 123456 --- in case it is comparable to "search" (and nothing in case it is comparable to "match"). Obviously this is important as then -- cannot be distinguished from --- (or === from ===), i.e. headlines, enumbering, ... cannot be differentiated.

So why does PLY behaves differently for standard Python/regex? (and how? - couldn't find something in the documentation, or here at stackoverflow).

I would guess it is more my understanding of PLY as the tool is around for quite some time already, i.e. this behavior is in there by intention I would guess. The only somehow related information I could find deals with different groups but does not explain a different behavior of identifying regexes itself. I found nothing in ply-hack as well.

Am I overlooking something stupid simple?

For comparison purposes here standard Python / regex:

import re

text = r'--- 123456 ---'
token1 = r'-- .* --'

p = re.compile(token1)

m = p.search(text)
if m:
    print 'Match found: ', m.group()
else:
    print 'No match'

m = p.match(text)
if m:
    print 'Match found: ', m.group()
else:
    print 'No match'

gives:

Match found:  -- 123456 --
No match

(as expected, first is the result of "search", second of "match")

My settings: I am working with spyder - this is the terminal display at start:

Python 2.7.5+ (default, Sep 19 2013, 13:49:51) 
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.

Imported NumPy 1.7.1, SciPy 0.12.0, Matplotlib 1.2.1
Type "scientific" for more details.

Thanks for your time and help.

È stato utile?

Soluzione

The answer in ply lexmatch regular expression has different groups than a usual re helps here too. In lex.py:

c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)

Notice the VERBOSE flag. It means the re engine ignores the whitespace characters in your regexps. So r'-- .* --' really means r'--.*--', which indeed matches completely a string like '--- foobar ---'. See the documentation of re.VERBOSE for more details.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top