Domanda

I still consider myself a newbie to pyparsing. I threw together 2 quick grammars and neither succeeds at what I am trying to do. I am trying to come up with a grammar that seems really simple to do but it turns out to be (at least for me) not so trivial. The language has one basic definition. its broken down by keywords and body text. body's can span multiple lines. keywords are found at the beginning of a line within the first 20 chars or so but are terminated with a ';' (no quotes). So I threw together a quick demo program so I could test with a couple of grammars. However when I try to use them, they always get the first keyword but none after that.

I've attached the source code as an example and the output that is occurring. Even though this is just test code, out of habit i did documentation. In the example below the two keywords are NOW; and lastly; Ideally I wouldn't want the semicolon included in the keyword.

Any ideas what I should do to make this work?

from pyparsing import *

def testString(text,grammar):
    """
    @summary: perform a test of a grammar
    2type text: text
    @param text: text buffer for input (a message to be parsed)
    @type grammar: MatchFirst or equivalent pyparsing construct
    @param grammar: some grammar defined somewhere else
    @type pgm: text
    @param pgm: typically name of the program, which invoked this function.
    @status: 20130802 CODED
    """
    print 'Input Text is %s' % text
    print 'Grammar is %s' % grammar
    tokens = grammar.parseString(text)
    print 'After parse string: %s' % tokens
    tokens.dump()
    tokens.keys()

    return tokens


def getText(msgIndex):
    """
    @summary: make a text string suitable for parsing
    @returns: returns a text buffer
    @type msgIndex: int
    @param msgIndex: a number corresponding to a text buffer to retrieve
    @status: 20130802 CODED
    """

    msg = [  """NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
    """,
         '',
      ]

    return msg[msgIndex]

def getGrammar(grammarIndex):
    """
    @summary: make a grammar given an index
    @type: grammarIndex: int
    @param grammarIndex: a number corresponding to the grammar to be retrieved
    @Note: a good run will return 2 keys: NOW: and lastly:  and each key will have an associated body. The body is all
    words and text up to the next keyword or eof which ever is first.
    """
    kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
    kw.setDebug(True)
    body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
    body1.setDebug(True)
    g1 = OneOrMore(Group(kw + body1))

    # ok start defining a new grammar (borrow kw from grammar).

    body2 = SkipTo(~kw, include=False)('BODY')
    body2.setDebug(True)

    g2 = OneOrMore(Group(kw+body2))
    grammar = [g1,
           g2,
          ]
    return grammar[grammarIndex]


if __name__ == '__main__':
    # list indices [ text, grammar ]
    tests = {1: [0,0],
         2: [0,1],
        }
    check = tests.keys()
    check.sort()
    for testno in check:
    print 'STARTING Test %d' % testno
    text = getText(tests[testno][0])
    grammar = getGrammar(tests[testno][1])
    tokens = testString(text, grammar)
    print 'Tokens found %s' % tokens
    print 'ENDING Test %d' % testno

the output looks like this: (using python 2.7 and pyparsing 2.0.1)

    STARTING Test 1
    Input Text is NOW; is the time for a few good ones to come to the aid
    of new things to come for it is almost time for
    a tornado to strike upon a small hill
    when least expected.
    lastly; another day progresses and
    then we find that which we seek
    and finally we will
    find our happiness perhaps its closer than 1 or 2 years or not so

    Grammar is {Group:({Combine:({W:(abcd...) ";"}) {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]...})}...
    Match Combine:({W:(abcd...) ";"}) at loc 0(1,1)
    Matched Combine:({W:(abcd...) ";"}) -> ['NOW;']
    Match {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]... at loc 4(1,5)
    Match Combine:({W:(abcd...) ";"}) at loc 161(4,20)
    Exception raised:Expected W:(abcd...) (at char 161), (line:4, col:20)
    Matched {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}} [, {{W:(abcd...)}... ~{Combine:({W:(abcd...) ";"})}}]... -> ['is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
    Match Combine:({W:(abcd...) ";"}) at loc 161(4,20)
    Exception raised:Expected W:(abcd...) (at char 161), (line:4, col:20)
    After parse string: [['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']]
    Tokens found [['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']]
    ENDING Test 1
    STARTING Test 2
    Input Text is NOW; is the time for a few good ones to come to the aid
    of new things to come for it is almost time for
    a tornado to strike upon a small hill
    when least expected.
    lastly; another day progresses and
    then we find that which we seek
    and finally we will
    find our happiness perhaps its closer than 1 or 2 years or not so

    Grammar is {Group:({Combine:({W:(abcd...) ";"}) SkipTo:(~{Combine:({W:(abcd...) ";"})})})}...
    Match Combine:({W:(abcd...) ";"}) at loc 0(1,1)
    Matched Combine:({W:(abcd...) ";"}) -> ['NOW;']
    Match SkipTo:(~{Combine:({W:(abcd...) ";"})}) at loc 4(1,5)
    Match Combine:({W:(abcd...) ";"}) at loc 4(1,5)
    Exception raised:Expected ";" (at char 7), (line:1, col:8)
    Matched SkipTo:(~{Combine:({W:(abcd...) ";"})}) -> ['']
    Match Combine:({W:(abcd...) ";"}) at loc 5(1,6)
    Exception raised:Expected ";" (at char 7), (line:1, col:8)
    After parse string: [['NOW;', '']]
    Tokens found [['NOW;', '']]
    ENDING Test 2

    Process finished with exit code 0
È stato utile?

Soluzione

I'm good with TDD, but here your whole testing and alternative-selecting infrastructure really gets in the way of seeing just where the grammar is and what's going on with it. If I strip away all the extra machinery, I see your grammar is just:

kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')
g1 = OneOrMore(Group(kw + body1))

The first issue I see is your definition of body1:

body1 = delimitedList(OneOrMore(Word(alphas + nums)) +~kw)('Body')

You are on the right track with a negative lookahead, but for it to work in pyparsing, you have to put it at the beginning of the expression, not at the end. Think of it as "before I match another valid word, I will first rule out that it is a keyword.":

body1 = delimitedList(OneOrMore(~kw + Word(alphas + nums)))('Body')

(Why is this a delimitedList, by the way? delimitedList is usually reserved for true lists with delimiters, such as comma-delimited arguments to a program function. All this does is accept any commas that might be mixed into the body, which should be handled more straightforwardly using a list of punctuation.)

Here my test version of your code:

from pyparsing import *

kw = Combine(Word(alphas + nums) + Literal(';'))('KEY')
body1 = OneOrMore(~kw + Word(alphas + nums))('Body')
g1 = OneOrMore(Group(kw + body1))

msg = [  """NOW; is the time for a few good ones to come to the aid
of new things to come for it is almost time for
a tornado to strike upon a small hill
when least expected.
lastly; another day progresses and
then we find that which we seek
and finally we will
find our happiness perhaps its closer than 1 or 2 years or not so
    """,
             '',
          ][0]

result = g1.parseString(msg)
# we expect multiple groups, each containing "KEY" and "Body" names,
# so iterate over groups, and dump the contents of each
for res in result:
    print res.dump()

I still get the same results as you, just the first keyword matches. So to see where the disconnect is happening, I use scanString, which returns not only the matched tokens, but also the start and end of the matched tokens:

result,start,end = next(g1.scanString(msg))
print len(msg),end

Which gives me:

320 161

So I see that we are ending at location 161 in a string whose total length is 320, so I'll add one more print statement:

print msg[end:end+10]

and I get:

.
lastly;

The trailing period in your body text is the culprit. If I remove that from message and try parseString again, I now get:

['NOW;', 'is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
- Body: ['is', 'the', 'time', 'for', 'a', 'few', 'good', 'ones', 'to', 'come', 'to', 'the', 'aid', 'of', 'new', 'things', 'to', 'come', 'for', 'it', 'is', 'almost', 'time', 'for', 'a', 'tornado', 'to', 'strike', 'upon', 'a', 'small', 'hill', 'when', 'least', 'expected']
- KEY: NOW;
['lastly;', 'another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so']
- Body: ['another', 'day', 'progresses', 'and', 'then', 'we', 'find', 'that', 'which', 'we', 'seek', 'and', 'finally', 'we', 'will', 'find', 'our', 'happiness', 'perhaps', 'its', 'closer', 'than', '1', 'or', '2', 'years', 'or', 'not', 'so']
- KEY: lastly;

If you want to handle punctuation, I suggest you add something like:

PUNC = oneOf(". , ? ! : & $")

and add it to body1:

body1 = OneOrMore(~kw + (Word(alphas + nums) | PUNC))('Body')
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top