Combina eficientemente múltiples expresiones regulares en Python

https://stackoverflow.com/questions/133886

02-07-2019
|

Pregunta

Los analizadores léxicos son bastante fáciles de escribir cuando tienes expresiones regulares. Hoy quería escribir un analizador general simple en Python, y se me ocurrió:

import re
import sys

class Token(object):
    """ A simple Token structure.
        Contains the token type, value and position. 
    """
    def __init__(self, type, val, pos):
        self.type = type
        self.val = val
        self.pos = pos

    def __str__(self):
        return '%s(%s) at %s' % (self.type, self.val, self.pos)


class LexerError(Exception):
    """ Lexer error exception.

        pos:
            Position in the input line where the error occurred.
    """
    def __init__(self, pos):
        self.pos = pos


class Lexer(object):
    """ A simple regex-based lexer/tokenizer.

        See below for an example of usage.
    """
    def __init__(self, rules, skip_whitespace=True):
        """ Create a lexer.

            rules:
                A list of rules. Each rule is a `regex, type`
                pair, where `regex` is the regular expression used
                to recognize the token and `type` is the type
                of the token to return when it's recognized.

            skip_whitespace:
                If True, whitespace (\s+) will be skipped and not
                reported by the lexer. Otherwise, you have to 
                specify your rules for whitespace, or it will be
                flagged as an error.
        """
        self.rules = []

        for regex, type in rules:
            self.rules.append((re.compile(regex), type))

        self.skip_whitespace = skip_whitespace
        self.re_ws_skip = re.compile('\S')

    def input(self, buf):
        """ Initialize the lexer with a buffer as input.
        """
        self.buf = buf
        self.pos = 0

    def token(self):
        """ Return the next token (a Token object) found in the 
            input buffer. None is returned if the end of the 
            buffer was reached. 
            In case of a lexing error (the current chunk of the
            buffer matches no rule), a LexerError is raised with
            the position of the error.
        """
        if self.pos >= len(self.buf):
            return None
        else:
            if self.skip_whitespace:
                m = self.re_ws_skip.search(self.buf[self.pos:])

                if m:
                    self.pos += m.start()
                else:
                    return None

            for token_regex, token_type in self.rules:
                m = token_regex.match(self.buf[self.pos:])

                if m:
                    value = self.buf[self.pos + m.start():self.pos + m.end()]
                    tok = Token(token_type, value, self.pos)
                    self.pos += m.end()
                    return tok

            # if we're here, no rule matched
            raise LexerError(self.pos)

    def tokens(self):
        """ Returns an iterator to the tokens found in the buffer.
        """
        while 1:
            tok = self.token()
            if tok is None: break
            yield tok


if __name__ == '__main__':
    rules = [
        ('\d+',             'NUMBER'),
        ('[a-zA-Z_]\w+',    'IDENTIFIER'),
        ('\+',              'PLUS'),
        ('\-',              'MINUS'),
        ('\*',              'MULTIPLY'),
        ('\/',              'DIVIDE'),
        ('\(',              'LP'),
        ('\)',              'RP'),
        ('=',               'EQUALS'),
    ]

    lx = Lexer(rules, skip_whitespace=True)
    lx.input('erw = _abc + 12*(R4-623902)  ')

    try:
        for tok in lx.tokens():
            print tok
    except LexerError, err:
        print 'LexerError at position', err.pos

Funciona bien, pero me preocupa un poco que sea demasiado ineficiente. ¿Hay algún truco regex que me permita escribirlo de una manera más eficiente / elegante?

Específicamente, ¿hay alguna manera de evitar recorrer todas las reglas de expresión regular linealmente para encontrar una que se ajuste?

Solución

Puede combinar todas sus expresiones regulares en una utilizando " | " operador y deje que la biblioteca de expresiones regulares haga el trabajo de discernir entre tokens. Se debe tener cuidado para garantizar la preferencia de los tokens (por ejemplo, para evitar que una palabra clave coincida como identificador).

Otros consejos

Sugiero usar la clase re.Scanner, no está documentada en la biblioteca estándar, pero vale la pena usarla. Aquí hay un ejemplo:

import re

scanner = re.Scanner([
    (r"-?[0-9]+\.[0-9]+([eE]-?[0-9]+)?", lambda scanner, token: float(token)),
    (r"-?[0-9]+", lambda scanner, token: int(token)),
    (r" +", lambda scanner, token: None),
])

>>> scanner.scan("0 -1 4.5 7.8e3")[0]
[0, -1, 4.5, 7800.0]

Encontré esto en el documento de Python. Es simple y elegante.

import collections
import re

Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])

def tokenize(s):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',  r'\d+(\.\d*)?'), # Integer or decimal number
        ('ASSIGN',  r':='),          # Assignment operator
        ('END',     r';'),           # Statement terminator
        ('ID',      r'[A-Za-z]+'),   # Identifiers
        ('OP',      r'[+*\/\-]'),    # Arithmetic operators
        ('NEWLINE', r'\n'),          # Line endings
        ('SKIP',    r'[ \t]'),       # Skip over spaces and tabs
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    get_token = re.compile(tok_regex).match
    line = 1
    pos = line_start = 0
    mo = get_token(s)
    while mo is not None:
        typ = mo.lastgroup
        if typ == 'NEWLINE':
            line_start = pos
            line += 1
        elif typ != 'SKIP':
            val = mo.group(typ)
            if typ == 'ID' and val in keywords:
                typ = val
            yield Token(typ, val, line, mo.start()-line_start)
        pos = mo.end()
        mo = get_token(s, pos)
    if pos != len(s):
        raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line))

statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''

for token in tokenize(statements):
    print(token)

El truco aquí es la línea:

tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)

Aquí (?P<ID>PATTERN) marcará el resultado coincidente con un nombre especificado por ID.

re.match está anclado. Puede darle un argumento de posición:

pos = 0
end = len(text)
while pos < end:
    match = regexp.match(text, pos)
    # do something with your match
    pos = match.end()

Eche un vistazo a los pigmentos que envían una gran cantidad de lexers para resaltar la sintaxis con diferentes implementaciones, la mayoría basadas en expresiones regulares.

Es posible que la combinación de expresiones regulares de token funcione, pero tendría que compararla. Algo así como:

x = re.compile('(?P<NUMBER>[0-9]+)|(?P<VAR>[a-z]+)')
a = x.match('9999').groupdict() # => {'VAR': None, 'NUMBER': '9999'}
if a:
    token = [a for a in a.items() if a[1] != None][0]

El filtro es donde tendrás que hacer algunos benchmarking ...

Actualización: probé esto, y parece que combina todos los tokens como se indica y escribe una función como:

def find_token(lst):
    for tok in lst:
        if tok[1] != None: return tok
    raise Exception

Obtendrá aproximadamente la misma velocidad (tal vez un poco más rápido) para esto. Creo que la aceleración debe estar en el número de llamadas para que coincida, pero el ciclo para la discriminación de tokens sigue ahí, lo que por supuesto lo mata.

Esta no es exactamente una respuesta directa a su pregunta, pero es posible que desee mirar ANTLR . De acuerdo con este documento, el objetivo de generación de código de Python debería estar activo hasta la fecha.

En cuanto a sus expresiones regulares, realmente hay dos formas de acelerarlo si se apega a las expresiones regulares. El primero sería ordenar sus expresiones regulares en el orden de la probabilidad de encontrarlas en un texto predeterminado. Podría imaginar agregar un generador de perfiles simple al código que recopiló el recuento de tokens para cada tipo de token y ejecutar el lexer en un cuerpo de trabajo. La otra solución sería ordenar los expresiones regulares (ya que su espacio clave, al ser un carácter, es relativamente pequeño) y luego usar una matriz o diccionario para realizar las expresiones regulares necesarias después de realizar una sola discriminación en el primer carácter.

Sin embargo, creo que si vas a seguir esta ruta, deberías probar algo como ANTLR que será más fácil de mantener, más rápido y menos propenso a tener errores.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow