Exclude token definition which are in a predefined keyword list

https://stackoverflow.com/questions/16824927

30-05-2022
|

Question

I am expected to parse a document for tokens. A token can be a string consisting of alphanumeric characters with first character being alphabetic.

In the example below, I want to see the following tokens (zoo, abcd, moo, pqr, join6)

My code looks like following

#!/usr/bin/env python
from pyparsing import *

reserved_words = ( Keyword('JOIN') | Keyword('ON'))
identifier = Combine(Word(alphas, exact=1) + Optional(Word(alphanums + '_')))
token = ~reserved_words + identifier

txt = """
JOIN zoo ON abcd JOIN
moo ON join6;"""

for token, start, end in token.scanString(txt):
    print token, start, end

Output that I see is:

['OIN'] 2 5
['zoo'] 5 9
['N'] 11 12
['abcd'] 12 17   
['OIN'] 19 22
['moo'] 22 26
['N'] 28 29
['join6'] 29 35

I will appreciate any help.

Additional example:

I have to parse SQL like language which has keywords like JOIN, ON , AS etc. I changed the definition of "table" the way you suggested. The use of keyword 'AS' as well as the aliasing identifier after AS are optional. For the second line in "txt" no "AS" and "alias"-ing identifier is used. But the output that I get is as follows. I don't understand why this happens.

#!/usr/bin/env python
from pyparsing import *

join_kw , on_kw, as_kw = map(lambda x: Keyword(x, caseless=True), ['JOIN' , 'ON', 'AS'])
reserved_words = ( join_kw | on_kw | as_kw)
identifier = Combine(Word(alphas, exact=1) + Optional(Word(alphanums + '_')))
table = (reserved_words.suppress() | identifier) 
stmt = join_kw + table + Optional(as_kw) + Optional(identifier) + on_kw
txt = """
JOIN zoo AS t ON abcd 
JOIN moo ON join6;"""

for token, start, end in stmt.scanString(txt):
    if len(token) != 0:  
        print token, start, end

['JOIN', 'zoo', 'AS', 't', 'ON'] 1 17

Solution

scanString is there to go through the input string scanning for matches. It goes character by character to do this scanning. At position 1, it tries to match token, and fails because JOIN is a reserved word and so fails the NotAny lookahead. Then scanString advances to position 2. OIN is a perfectly valid token, and so it is reported as a match.

If you just want the tokens and want scanString to skip over the keywords, then use:

for token, start, end in (reserved_words.suppress() | token).scanString(txt):

Or use parseString instead of scanString:

for item in ZeroOrMore(reserved_words|token).parseString(txt):

Also, Word has a 2-argument initializer that will simplify your definition of identifier:

identifier = Word(alphas, alphanums + '_')

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow