Abbina in modo efficiente più regex in Python

https://stackoverflow.com/questions/133886

02-07-2019
|

Domanda

Gli analizzatori lessicali sono abbastanza facili da scrivere quando si hanno regex. Oggi volevo scrivere un semplice analizzatore generale in Python e ho trovato:

import re
import sys

class Token(object):
    """ A simple Token structure.
        Contains the token type, value and position. 
    """
    def __init__(self, type, val, pos):
        self.type = type
        self.val = val
        self.pos = pos

    def __str__(self):
        return '%s(%s) at %s' % (self.type, self.val, self.pos)


class LexerError(Exception):
    """ Lexer error exception.

        pos:
            Position in the input line where the error occurred.
    """
    def __init__(self, pos):
        self.pos = pos


class Lexer(object):
    """ A simple regex-based lexer/tokenizer.

        See below for an example of usage.
    """
    def __init__(self, rules, skip_whitespace=True):
        """ Create a lexer.

            rules:
                A list of rules. Each rule is a `regex, type`
                pair, where `regex` is the regular expression used
                to recognize the token and `type` is the type
                of the token to return when it's recognized.

            skip_whitespace:
                If True, whitespace (\s+) will be skipped and not
                reported by the lexer. Otherwise, you have to 
                specify your rules for whitespace, or it will be
                flagged as an error.
        """
        self.rules = []

        for regex, type in rules:
            self.rules.append((re.compile(regex), type))

        self.skip_whitespace = skip_whitespace
        self.re_ws_skip = re.compile('\S')

    def input(self, buf):
        """ Initialize the lexer with a buffer as input.
        """
        self.buf = buf
        self.pos = 0

    def token(self):
        """ Return the next token (a Token object) found in the 
            input buffer. None is returned if the end of the 
            buffer was reached. 
            In case of a lexing error (the current chunk of the
            buffer matches no rule), a LexerError is raised with
            the position of the error.
        """
        if self.pos >= len(self.buf):
            return None
        else:
            if self.skip_whitespace:
                m = self.re_ws_skip.search(self.buf[self.pos:])

                if m:
                    self.pos += m.start()
                else:
                    return None

            for token_regex, token_type in self.rules:
                m = token_regex.match(self.buf[self.pos:])

                if m:
                    value = self.buf[self.pos + m.start():self.pos + m.end()]
                    tok = Token(token_type, value, self.pos)
                    self.pos += m.end()
                    return tok

            # if we're here, no rule matched
            raise LexerError(self.pos)

    def tokens(self):
        """ Returns an iterator to the tokens found in the buffer.
        """
        while 1:
            tok = self.token()
            if tok is None: break
            yield tok


if __name__ == '__main__':
    rules = [
        ('\d+',             'NUMBER'),
        ('[a-zA-Z_]\w+',    'IDENTIFIER'),
        ('\+',              'PLUS'),
        ('\-',              'MINUS'),
        ('\*',              'MULTIPLY'),
        ('\/',              'DIVIDE'),
        ('\(',              'LP'),
        ('\)',              'RP'),
        ('=',               'EQUALS'),
    ]

    lx = Lexer(rules, skip_whitespace=True)
    lx.input('erw = _abc + 12*(R4-623902)  ')

    try:
        for tok in lx.tokens():
            print tok
    except LexerError, err:
        print 'LexerError at position', err.pos

Funziona bene, ma sono un po 'preoccupato che sia troppo inefficiente. Ci sono dei trucchi regex che mi permetteranno di scriverlo in un modo più efficiente / elegante?

In particolare, c'è un modo per evitare di scorrere in modo lineare tutte le regole regex per trovarne una adatta?

Soluzione

Puoi unire tutti i tuoi regex in uno usando il " | " operatore e lascia che la libreria regex faccia il lavoro di discernimento tra i token. È necessario prestare attenzione per garantire la preferenza dei token (ad esempio per evitare di abbinare una parola chiave come identificatore).

Altri suggerimenti

Suggerisco di usare la classe re.Scanner, non è documentata nella libreria standard, ma vale la pena usarla. Ecco un esempio:

import re

scanner = re.Scanner([
    (r"-?[0-9]+\.[0-9]+([eE]-?[0-9]+)?", lambda scanner, token: float(token)),
    (r"-?[0-9]+", lambda scanner, token: int(token)),
    (r" +", lambda scanner, token: None),
])

>>> scanner.scan("0 -1 4.5 7.8e3")[0]
[0, -1, 4.5, 7800.0]

Ho trovato questo nel documento Python. È semplice ed elegante.

import collections
import re

Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])

def tokenize(s):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',  r'\d+(\.\d*)?'), # Integer or decimal number
        ('ASSIGN',  r':='),          # Assignment operator
        ('END',     r';'),           # Statement terminator
        ('ID',      r'[A-Za-z]+'),   # Identifiers
        ('OP',      r'[+*\/\-]'),    # Arithmetic operators
        ('NEWLINE', r'\n'),          # Line endings
        ('SKIP',    r'[ \t]'),       # Skip over spaces and tabs
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    get_token = re.compile(tok_regex).match
    line = 1
    pos = line_start = 0
    mo = get_token(s)
    while mo is not None:
        typ = mo.lastgroup
        if typ == 'NEWLINE':
            line_start = pos
            line += 1
        elif typ != 'SKIP':
            val = mo.group(typ)
            if typ == 'ID' and val in keywords:
                typ = val
            yield Token(typ, val, line, mo.start()-line_start)
        pos = mo.end()
        mo = get_token(s, pos)
    if pos != len(s):
        raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line))

statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''

for token in tokenize(statements):
    print(token)

Il trucco qui è la linea:

tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)

Qui (?P<ID>PATTERN) contrassegnerà il risultato corrispondente con un nome specificato da ID.

re.match è ancorato. Puoi dargli un argomento di posizione:

pos = 0
end = len(text)
while pos < end:
    match = regexp.match(text, pos)
    # do something with your match
    pos = match.end()

Cerca i pigmenti che spediscono un carico di lexer per evidenziare la sintassi con diverse implementazioni, la maggior parte basate su espressioni regolari.

È possibile che la combinazione delle regex token funzionerà, ma dovresti confrontarla. Qualcosa del tipo:

x = re.compile('(?P<NUMBER>[0-9]+)|(?P<VAR>[a-z]+)')
a = x.match('9999').groupdict() # => {'VAR': None, 'NUMBER': '9999'}
if a:
    token = [a for a in a.items() if a[1] != None][0]

Il filtro è dove dovrai fare dei benchmark ...

Aggiornamento: l'ho testato e sembra che se si combinano tutti i token come indicato e si scrive una funzione come:

def find_token(lst):
    for tok in lst:
        if tok[1] != None: return tok
    raise Exception

Otterrai all'incirca la stessa velocità (forse un adolescente più veloce) per questo. Credo che l'accelerazione debba essere nel numero di chiamate da abbinare, ma il ciclo per la discriminazione dei token è ancora lì, il che ovviamente lo uccide.

Questa non è esattamente una risposta diretta alla tua domanda, ma potresti voler guardare ANTLR . Secondo questo il target di generazione del codice Python dovrebbe essere alto ad oggi.

Per quanto riguarda le tue regex, ci sono davvero due modi per accelerarla se stai aderendo alle regex. Il primo sarebbe quello di ordinare le tue regex nell'ordine della probabilità di trovarle in un testo predefinito. Potresti immaginare di aggiungere un semplice profiler al codice che ha raccolto i conteggi dei token per ciascun tipo di token e di eseguire il lexer su un corpo di lavoro. L'altra soluzione sarebbe quella di riordinare i tuoi regex (dal momento che il tuo spazio chiave, essendo un personaggio, è relativamente piccolo) e quindi utilizzare un array o un dizionario per eseguire i regex necessari dopo aver eseguito una singola discriminazione sul primo carattere.

Tuttavia, penso che se hai intenzione di seguire questa strada, dovresti davvero provare qualcosa come ANTLR che sarà più facile da mantenere, più veloce e con meno probabilità di avere bug.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow