Вопрос

I am trying to parse a few SQL statements. Here is a sample:

select
    ms.member_sk a,
    dd.date_sk b,
    st.subscription_type,
    (SELECT foo FROM zoo) e
from dim_member_subscription_all p,
     dim_subs_type
where a in (select moo from t10)

I am interested in getting tables only at this time. So I would like to see [zoo, dim_member_subscription_all, dim_subs_type] & [t10]

I have put together a small script looking at Paul McGuire's example

#!/usr/bin/env python
import sys
import pprint
from pyparsing import *


pp = pprint.PrettyPrinter(indent=4)
semicolon = Combine(Literal(';') + lineEnd)
comma = Literal(',')
lparen = Literal('(')
rparen = Literal(')')

update_kw, volatile_kw, create_kw, table_kw, as_kw, from_kw, \
where_kw, join_kw, left_kw, right_kw, cross_kw, outer_kw, \
on_kw , insert_kw , into_kw= \
    map(lambda x: Keyword(x, caseless=True), \
        ['UPDATE', 'VOLATILE', 'CREATE', 'TABLE', 'AS', 'FROM',
         'WHERE', 'JOIN' , 'LEFT', 'RIGHT' , \
         'CROSS', 'OUTER', 'ON', 'INSERT', 'INTO'])

select_kw = Keyword('SELECT', caseless=True) | Keyword('SEL' , caseless=True)

reserved_words = (update_kw | volatile_kw | create_kw | table_kw | as_kw |
                  select_kw | from_kw | where_kw | join_kw |
                  left_kw | right_kw | cross_kw | on_kw | insert_kw |
                  into_kw)

ident = ~reserved_words + Word(alphas, alphanums + '_')

table = Combine(Optional(ident + Literal('.')) + ident)
column = Combine(Optional(ident + Literal('.')) + (ident | Literal('*')))

column_alias = Optional(Optional(as_kw).suppress() + ident)
table_alias = Optional(Optional(as_kw).suppress() + ident).suppress()

select_stmt = Forward()
nested_table = lparen.suppress() + select_stmt + rparen.suppress() + table_alias
table_list = delimitedList((nested_table | table) + table_alias)
column_list = delimitedList((nested_table | column) + column_alias)

txt = """
select
       ms.member_sk a,
       dd.date_sk b,
       st.subscription_type,
       (SELECT foo FROM zoo) e
from dim_member_subscription_all p,
     dim_subs_type
where a in (select moo from t10)
"""

select_stmt << select_kw.suppress() + column_list + from_kw.suppress() +  \
               table_list.setResultsName('tables', listAllMatches=True)

print txt

for token in select_stmt.searchString(txt):
    pp.pprint(token.asDict())

I am getting the following nested output. Can anybody please help me understand what I am doing wrong?

{   'tables': ([(['zoo'], {}), (['dim_member_subscription_all', 'dim_subs_type'], {})], {})}
{   'tables': ([(['t10'], {})], {})}
Это было полезно?

Решение

searchString will return a list of all matching ParseResults - you can see the tables value of each using:

for token in select_stmt.searchString(txt):
    print token.tables

Giving:

[['zoo'], ['dim_member_subscription_all', 'dim_subs_type']]
[['t10']]

So searchString found two SELECT statements.

Recent versions of pyparsing support summing this list into a single consolidated using Python builtin sum. Accessing the tables value of this consolidated result looks like this:

print sum(select_stmt.searchString(txt)).tables

[['zoo'], ['dim_member_subscription_all', 'dim_subs_type'], ['t10']]

I think the parser is doing all you want, you just need to figure out how to process the returned results.

For further debugging, you should start using the dump method on ParseResults to see what you are getting, which will print the nested list of returned tokens, and then a hierarchical tree of all named results. For your example:

for token in select_stmt.searchString(txt):
    print token.dump()
    print

prints:

['ms.member_sk', 'a', 'dd.date_sk', 'b', 'st.subscription_type', 'foo', 'zoo', 'dim_member_subscription_all', 'dim_subs_type']
- tables: [['zoo'], ['dim_member_subscription_all', 'dim_subs_type']]

['moo', 't10']
- tables: [['t10']]
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top