Question

I am quite new pyparsing user and have missing match i don't understand:

Here is the text i would like to parse:

polraw="""
set policy id 800 from "Untrust" to "Trust"  "IP_10.124.10.6" "MIP(10.0.2.175)" "TCP_1002" permit
set policy id 800
set dst-address "MIP(10.0.2.188)"
set service "TCP_1002-1005"
set log session-init
exit
set policy id 724 from "Trust" to "Untrust"  "IP_10.16.14.28" "IP_10.24.10.6" "TCP_1002" permit
set policy id 724
set src-address "IP_10.162.14.38"
set dst-address "IP_10.3.28.38"
set service "TCP_1002-1005"
set log session-init
exit
set policy id 233 name "THE NAME is 527 ;" from "Untrust" to "Trust"  "IP_10.24.108.6" "MIP(10.0.2.149)" "TCP_1002" permit
set policy id 233
set service "TCP_1002-1005"
set service "TCP_1006-1008"
set service "TCP_1786"
set log session-init
exit

"""

I setup grammar this way:

KPOL  = Suppress(Keyword('set policy id'))
NUM   = Regex(r'\d+')
KSVC  = Suppress(Keyword('set service'))
KSRC  = Suppress(Keyword('set src-address'))
KDST  = Suppress(Keyword('set dst-address'))
SVC    = dblQuotedString.setParseAction(lambda t: t[0].replace('"',''))
ADDR   = dblQuotedString.setParseAction(lambda t: t[0].replace('"',''))
EXIT  = Suppress(Keyword('exit'))
EOL = LineEnd().suppress()

P_SVC = KSVC + SVC + EOL
P_SRC = KSRC + ADDR + EOL
P_DST = KDST + ADDR + EOL

x = KPOL + NUM('PId') + EOL + Optional(ZeroOrMore(P_SVC)) + Optional(ZeroOrMore(P_SRC)) + Optional(ZeroOrMore(P_DST)) 

for z in x.searchString(polraw):
    print z

Result set is such as

['800', 'MIP(10.0.2.188)']
['724', 'IP_10.162.14.38', 'IP_10.3.28.38']
['233', 'TCP_1002-1005', 'TCP_1006-1008', 'TCP_1786']

The 800 is missing service tag ???

What's wrong here.

Thanks by advance Laurent

Was it helpful?

Solution

The problem you are seeing is that in your expression, DST's are only looked for after having skipped over optional SVC's and SRC's. You have a couple of options, I'll go through each so you can get a sense of what all is going on here.

(But first, there is no point in writing "Optional(ZeroOrMore(anything))" - ZeroOrMore already implies Optional, so I'm going to drop the Optional part in any of these choices.)

If you are going to get SVC's, SRC's, and DST's in any order, you could refactor your ZeroOrMore to accept any of the three data types, like this:

x = KPOL + NUM('PId') + EOL + ZeroOrMore(P_SVC|P_SRC|P_DST)

This will allow you to intermix different types of statements, and they will all get collected as part of the ZeroOrMore repetition.

If you want to keep these different types of statements in groups, then you can add a results name to each:

x = KPOL + NUM('PId') + EOL + ZeroOrMore(P_SVC("svc*")|
                                         P_SRC("src*")|
                                         P_DST("dst*"))

Note the trailing '*' on each name - this is equivalent to calling setResultsName with the listAllMatches argument equal to True. As each different expression is matched, the results for the different types will get collected into the "svc", "src", or "dst" results name. Calling z.dump() will list the tokens and the results names and their values, so you can see how this works.

set policy id 233
set service "TCP_1002-1005"
set dst-address "IP_10.3.28.38"
set service "TCP_1006-1008"
set service "TCP_1786"
set log session-init
exit

shows this for z.dump():

['233', 'TCP_1002-1005', 'IP_10.3.28.38', 'TCP_1006-1008', 'TCP_1786']
- PId: 233
- dst: [['IP_10.3.28.38']]
- svc: [['TCP_1002-1005'], ['TCP_1006-1008'], ['TCP_1786']]

If you wrap ungroup on the P_xxx expressions, maybe like this:

P_SVC,P_SRC,P_DST = (ungroup(expr) for expr in (P_SVC,P_SRC,P_DST))

then the output is even cleaner-looking:

['233', 'TCP_1002-1005', 'IP_10.3.28.38', 'TCP_1006-1008', 'TCP_1786']
- PId: 233
- dst: ['IP_10.3.28.38']
- svc: ['TCP_1002-1005', 'TCP_1006-1008', 'TCP_1786']

This is actually looking pretty good, but let me pass on one other option. There are a number of cases where parsers have to look for several sub-expressions in any order. Let's say they are A,B,C, and D. To accept these in any order, you could write something like OneOrMore(A|B|C|D), but this would accept multiple A's, or A, B, and C, but not D. The exhaustive/exhausting combinatorial explosion of (A+B+C+D) | (A+B+D+C) | etc. could be written, or you could maybe automate it with something like

from itertools import permutations
mixNmatch = MatchFirst(And(p) for p in permutations((A,B,C,D),4))

But there is a class in pyparsing called Each that allows to write the same kind of thing:

Each([A,B,C,D])

meaning "must have one each of A, B, C, and D, in any order". And like And, Or, NotAny, etc., there is an operator shortcut too:

A & B & C & D

which means the same thing.

If you want "must have A, B, and C, and optionally D", then write:

A & B & C & Optional(D)

and this will parse with the same kind of behavior, looking for A, B, C, and D, regardless of the incoming order, and whether D is last or mixed in with A, B, and C. You can also use OneOrMore and ZeroOrMore to indicate optional repetition of any of the expressions.

So you could write your expression as:

x = KPOL + NUM('PId') + EOL + (ZeroOrMore(P_SVC) & 
                               ZeroOrMore(P_SRC) & 
                               ZeroOrMore(P_DST))

I looked at using results names with this expression, and the ZeroOrMore's seem to be confusing things, maybe still a bug in how this is done. So you may have to reserve using Each for more basic cases like my A,B,C,D example. But I wanted to make you aware of it.

Some other notes on your parser:

dblQuotedString.setParseAction(lambda t: t[0].replace('"','')) is probably better written dblQuotedString.setParseAction(removeQuotes). You don't have any embedded quotes in your examples, but it's good to be aware of where your assumptions might not translate to a future application. Here are a couple of ways of removing the defining quotes:

dblQuotedString.setParseAction(lambda t: t[0].replace('"',''))
print dblQuotedString.parseString(r'"This is an embedded quote \" and an ending quote \""')[0]
# prints 'This is an embedded quote \ and an ending quote \'
# removed leading and trailing "s, but also internal ones too, which are 
# really part of the quoted string

dblQuotedString.setParseAction(lambda t: t[0].strip('"'))
print dblQuotedString.parseString(r'"This is an embedded quote \" and an ending quote \""')[0]
# prints 'This is an embedded quote \" and an ending quote \'
# removed leading and trailing "s, and leaves the one internal ones but strips off
# the escaped ending quote

dblQuotedString.setParseAction(removeQuotes)
print dblQuotedString.parseString(r'"This is an embedded quote \" and an ending quote \""')[0]
# prints 'This is an embedded quote \" and an ending quote \"'
# just removes leading and trailing " characters, leaves escaped "s in place

KPOL = Suppress(Keyword('set policy id')) is a bit fragile, as it will break if there are any extra spaces between 'set' and 'policy', or between 'policy' and 'id'. I usually define these kind of expressions by first defining all the keywords individually:

SET,POLICY,ID,SERVICE,SRC_ADDRESS,DST_ADDRESS,EXIT = map(Keyword,
    "set policy id service src-address dst-address exit".split())

and then define the separate expressions using:

KSVC  = Suppress(SET + SERVICE)
KSRC  = Suppress(SET + SRC_ADDRESS)
KDST  = Suppress(SET + DST_ADDRESS)

Now your parser will cleanly handle extra whitespace (or even comments!) between individual keywords in your expressions.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top