I am expected to parse a document for tokens. A token can be a string consisting of alphanumeric characters with first character being alphabetic.
In the example below, I want to see the following tokens (zoo, abcd, moo, pqr, join6)
My code looks like following
#!/usr/bin/env python
from pyparsing import *
reserved_words = ( Keyword('JOIN') | Keyword('ON'))
identifier = Combine(Word(alphas, exact=1) + Optional(Word(alphanums + '_')))
token = ~reserved_words + identifier
txt = """
JOIN zoo ON abcd JOIN
moo ON join6;"""
for token, start, end in token.scanString(txt):
print token, start, end
Output that I see is:
['OIN'] 2 5
['zoo'] 5 9
['N'] 11 12
['abcd'] 12 17
['OIN'] 19 22
['moo'] 22 26
['N'] 28 29
['join6'] 29 35
I will appreciate any help.
Additional example:
I have to parse SQL like language which has keywords like JOIN, ON , AS etc. I changed the definition of "table" the way you suggested. The use of keyword 'AS' as well as the aliasing identifier after AS are optional. For the second line in "txt" no "AS" and "alias"-ing identifier is used. But the output that I get is as follows. I don't understand why this happens.
#!/usr/bin/env python
from pyparsing import *
join_kw , on_kw, as_kw = map(lambda x: Keyword(x, caseless=True), ['JOIN' , 'ON', 'AS'])
reserved_words = ( join_kw | on_kw | as_kw)
identifier = Combine(Word(alphas, exact=1) + Optional(Word(alphanums + '_')))
table = (reserved_words.suppress() | identifier)
stmt = join_kw + table + Optional(as_kw) + Optional(identifier) + on_kw
txt = """
JOIN zoo AS t ON abcd
JOIN moo ON join6;"""
for token, start, end in stmt.scanString(txt):
if len(token) != 0:
print token, start, end
['JOIN', 'zoo', 'AS', 't', 'ON'] 1 17