Controlling Python PLY lexer states from parser

https://stackoverflow.com/questions/9848670

26-05-2021
|

سؤال

I am working on a simple SQL select like query parser and I need to be able to capture subqueries that can occur at certain places literally. I found lexer states are the best solution and was able to do a POC using curly braces to mark the start and end. However, the subqueries will be delimited by parenthesis, not curlys, and the parenthesis can occur at other places as well, so I can't being the state with every open-paren. This information is readily available with the parser, so I was hoping to call begin and end at appropriate locations in the parser rules. This however didn't work because lexer seem to tokenize the stream all at once, and so the tokens get generated in the INITIAL state. Is there a workaround for this problem? Here is an outline of what I tried to do:

def p_value_subquery(p):
    """
     value : start_sub end_sub
    """
    p[0] = "( " + p[1] + " )"

def p_start_sub(p):
    """
    start_sub : OPAR
    """
    start_subquery(p.lexer)
    p[0] = p[1]

def p_end_sub(p):
    """
    end_sub : CPAR
    """
    subquery = end_subquery(p.lexer)
    p[0] = subquery

The start_subquery() and end_subquery() are defined like this:

def start_subquery(lexer):
    lexer.code_start = lexer.lexpos        # Record the starting position
    lexer.level = 1
    lexer.begin('subquery') 

def end_subquery(lexer):
    value = lexer.lexdata[lexer.code_start:lexer.lexpos-1]
    lexer.lineno += value.count('\n')
    lexer.begin('INITIAL')
    return value

The lexer tokens are simply there to detect the close-paren:

@lex.TOKEN(r"\(")
def t_subquery_SUBQST(t):
    lexer.level += 1

@lex.TOKEN(r"\)")
def t_subquery_SUBQEN(t):
    lexer.level -= 1

@lex.TOKEN(r".")
def t_subquery_anychar(t):
    pass

I would appreciate any help.

المحلول 2

Based on PLY author's response, I came up with this better solution. I am yet to figure out how to return the subquery as a token, but the rest looks much better and need not be considered a hack anymore.

def start_subquery(lexer):
    lexer.code_start = lexer.lexpos        # Record the starting position
    lexer.level = 1
    lexer.begin("subquery")

def end_subquery(lexer):
    lexer.begin("INITIAL")

def get_subquery(lexer):
    value = lexer.lexdata[lexer.code_start:lexer.code_end-1]
    lexer.lineno += value.count('\n')
    return value

@lex.TOKEN(r"\(")
def t_subquery_OPAR(t):
    lexer.level += 1

@lex.TOKEN(r"\)")
def t_subquery_CPAR(t):
    lexer.level -= 1
    if lexer.level == 0:
        lexer.code_end = lexer.lexpos        # Record the ending position
        return t

@lex.TOKEN(r".")
def t_subquery_anychar(t):
    pass

def p_value_subquery(p):
    """
    value : check_subquery_start OPAR check_subquery_end CPAR
    """
    p[0] = "( " + get_subquery(p.lexer) + " )"

def p_check_subquery_start(p):
    """
    check_subquery_start : 
    """
    # Here last_token would be yacc's lookahead.
    if last_token.type == "OPAR":
        start_subquery(p.lexer)

def p_check_subquery_end(p):
    """
    check_subquery_end : 
    """
    # Here last_token would be yacc's lookahead.
    if last_token.type == "CPAR":
        end_subquery(p.lexer)

last_token = None

def p_error(p):
    global subquery_retry_pos
    if p is None:
        print >> sys.stderr, "ERROR: unexpected end of query"
    else:
        print >> sys.stderr, "ERROR: Skipping unrecognized token", p.type, "("+ \
                p.value+") at line:", p.lineno, "and column:", find_column(p.lexer.lexdata, p)
        # Just discard the token and tell the parser it's okay.
        yacc.errok()

def get_token():
    global last_token
    last_token = lexer.token()
    return last_token

def parse_query(input, debug=0):
    lexer.input(input)
    return parser.parse(input, tokenfunc=get_token, debug=0)

نصائح أخرى

This answer may only be partially helpful, but I would also suggest looking at section "6.11 Embedded Actions" of the PLY documentation (http://www.dabeaz.com/ply/ply.html). In a nutshell, it is possible to write grammar rules in which actions occur mid-rule. It would look something similar to this:

def p_somerule(p):
    '''somerule : A B possible_sub_query LBRACE sub_query RBRACE'''

def p_possible_sub_query(p):
    '''possible_sub_query :'''
    ...
    # Check if the last token read was LBRACE.   If so, flip lexer state
    # Sadly, it doesn't seem that the token is easily accessible. Would have to hack it
    if last_token == 'LBRACE':
        p.lexer.begin('SUBQUERY')

Regarding the behavior of the lexer, there is only one token of lookahead being used. So, in any particular grammar rule, at most only one extra token has been read already. If you're going to flip lexer states, you need to make sure that it happens before the token gets consumed by the parser, but before the parser asks to read the next incoming token.

Also, if possible, I would try to stay out of the yacc() error handling stack as far as a solution. There is way too much black-magic going on in error handling--the more you can avoid it, the better.

I'm a bit pressed for time at the moment, but this seems to be something that could be investigated for the next version of PLY. Will put it on my to-do list.

Since nobody has an answer, it bugged me to find a workaround, and here is an ugly hack using the error recovery and restart().

def start_subquery(lexer, pos):
    lexer.code_start = lexer.lexpos        # Record the starting position
    lexer.level = 1
    lexer.begin("subquery") 
    lexer.lexpos = pos

def end_subquery(lexer):
    value = lexer.lexdata[lexer.code_start:lexer.lexpos-1]
    lexer.lineno += value.count('\n')
    lexer.begin('INITIAL')
    return value

@lex.TOKEN(r"\(")
def t_subquery_SUBQST(t):
    lexer.level += 1

@lex.TOKEN(r"\)")
def t_subquery_SUBQEN(t):
    lexer.level -= 1
    if lexer.level == 0:
        t.type = "SUBQUERY"
        t.value = end_subquery(lexer)
        return t

@lex.TOKEN(r".")
def t_subquery_anychar(t):
    pass

# NOTE: Due to the nature of the ugly workaround, the CPAR gets dropped, which
# makes it look like there is an imbalance.
def p_value_subquery(p):
    """
     value : OPAR SUBQUERY
    """
    p[0] = "( " + p[2] + " )"

subquery_retry_pos = None

def p_error(p):
    global subquery_retry_pos
    if p is None:
        print >> sys.stderr, "ERROR: unexpected end of query"
    elif p.type == 'SELECT' and parser.symstack[-1].type == 'OPAR':
        lexer.input(lexer.lexdata)
        subquery_retry_pos = parser.symstack[-1].lexpos
        yacc.restart()
    else:
        print >> sys.stderr, "ERROR: Skipping unrecognized token", p.type, "("+ \
                p.value+") at line:", p.lineno, "and column:", find_column(p.lexer.lexdata, p)
        # Just discard the token and tell the parser it's okay.
        yacc.errok()

def get_token():
    global subquery_retry_pos
    token = lexer.token()
    if token and token.lexpos == subquery_retry_pos:
        start_subquery(lexer, lexer.lexpos)
        subquery_retry_pos = None
    return token

def parse_query(input, debug=0):
    lexer.input(inp)
    result = parser.parse(inp, tokenfunc=get_token, debug=0)

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow