Python에서 여러 Regexes를 효율적으로 일치시킵니다

https://stackoverflow.com/questions/133886

02-07-2019
|

문제

어휘 분석기는 regexes가있을 때 쓰기가 매우 쉽습니다. 오늘 저는 Python에 간단한 일반 분석기를 작성하고 싶었고 다음과 같이 생각했습니다.

import re
import sys

class Token(object):
    """ A simple Token structure.
        Contains the token type, value and position. 
    """
    def __init__(self, type, val, pos):
        self.type = type
        self.val = val
        self.pos = pos

    def __str__(self):
        return '%s(%s) at %s' % (self.type, self.val, self.pos)


class LexerError(Exception):
    """ Lexer error exception.

        pos:
            Position in the input line where the error occurred.
    """
    def __init__(self, pos):
        self.pos = pos


class Lexer(object):
    """ A simple regex-based lexer/tokenizer.

        See below for an example of usage.
    """
    def __init__(self, rules, skip_whitespace=True):
        """ Create a lexer.

            rules:
                A list of rules. Each rule is a `regex, type`
                pair, where `regex` is the regular expression used
                to recognize the token and `type` is the type
                of the token to return when it's recognized.

            skip_whitespace:
                If True, whitespace (\s+) will be skipped and not
                reported by the lexer. Otherwise, you have to 
                specify your rules for whitespace, or it will be
                flagged as an error.
        """
        self.rules = []

        for regex, type in rules:
            self.rules.append((re.compile(regex), type))

        self.skip_whitespace = skip_whitespace
        self.re_ws_skip = re.compile('\S')

    def input(self, buf):
        """ Initialize the lexer with a buffer as input.
        """
        self.buf = buf
        self.pos = 0

    def token(self):
        """ Return the next token (a Token object) found in the 
            input buffer. None is returned if the end of the 
            buffer was reached. 
            In case of a lexing error (the current chunk of the
            buffer matches no rule), a LexerError is raised with
            the position of the error.
        """
        if self.pos >= len(self.buf):
            return None
        else:
            if self.skip_whitespace:
                m = self.re_ws_skip.search(self.buf[self.pos:])

                if m:
                    self.pos += m.start()
                else:
                    return None

            for token_regex, token_type in self.rules:
                m = token_regex.match(self.buf[self.pos:])

                if m:
                    value = self.buf[self.pos + m.start():self.pos + m.end()]
                    tok = Token(token_type, value, self.pos)
                    self.pos += m.end()
                    return tok

            # if we're here, no rule matched
            raise LexerError(self.pos)

    def tokens(self):
        """ Returns an iterator to the tokens found in the buffer.
        """
        while 1:
            tok = self.token()
            if tok is None: break
            yield tok


if __name__ == '__main__':
    rules = [
        ('\d+',             'NUMBER'),
        ('[a-zA-Z_]\w+',    'IDENTIFIER'),
        ('\+',              'PLUS'),
        ('\-',              'MINUS'),
        ('\*',              'MULTIPLY'),
        ('\/',              'DIVIDE'),
        ('\(',              'LP'),
        ('\)',              'RP'),
        ('=',               'EQUALS'),
    ]

    lx = Lexer(rules, skip_whitespace=True)
    lx.input('erw = _abc + 12*(R4-623902)  ')

    try:
        for tok in lx.tokens():
            print tok
    except LexerError, err:
        print 'LexerError at position', err.pos

잘 작동하지만 너무 비효율적이라고 걱정합니다. 보다 효율적이고 우아한 방식으로 글을 쓸 수있는 동정형 트릭이 있습니까?

구체적으로, 모든 Regex 규칙을 선형으로 반복하지 않기 위해 적합한 규칙을 찾을 수있는 방법이 있습니까?

해결책

"|"를 사용하여 모든 regexes를 하나로 병합 할 수 있습니다. 운영자와 Regex 라이브러리가 토큰 사이에 분별력을 발휘하도록하십시오. 토큰의 선호를 보장하기 위해 일부주의를 기울여야합니다 (예 : 키워드를 식별자로 일치시키지 않도록).

다른 팁

Re.Scanner 클래스를 사용하는 것이 좋습니다. 표준 라이브러리에는 문서화되어 있지 않지만 사용 가치가 있습니다. 예는 다음과 같습니다.

import re

scanner = re.Scanner([
    (r"-?[0-9]+\.[0-9]+([eE]-?[0-9]+)?", lambda scanner, token: float(token)),
    (r"-?[0-9]+", lambda scanner, token: int(token)),
    (r" +", lambda scanner, token: None),
])

>>> scanner.scan("0 -1 4.5 7.8e3")[0]
[0, -1, 4.5, 7800.0]

나는 찾았다 이것 파이썬 문서에서. 단순하고 우아합니다.

import collections
import re

Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])

def tokenize(s):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',  r'\d+(\.\d*)?'), # Integer or decimal number
        ('ASSIGN',  r':='),          # Assignment operator
        ('END',     r';'),           # Statement terminator
        ('ID',      r'[A-Za-z]+'),   # Identifiers
        ('OP',      r'[+*\/\-]'),    # Arithmetic operators
        ('NEWLINE', r'\n'),          # Line endings
        ('SKIP',    r'[ \t]'),       # Skip over spaces and tabs
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    get_token = re.compile(tok_regex).match
    line = 1
    pos = line_start = 0
    mo = get_token(s)
    while mo is not None:
        typ = mo.lastgroup
        if typ == 'NEWLINE':
            line_start = pos
            line += 1
        elif typ != 'SKIP':
            val = mo.group(typ)
            if typ == 'ID' and val in keywords:
                typ = val
            yield Token(typ, val, line, mo.start()-line_start)
        pos = mo.end()
        mo = get_token(s, pos)
    if pos != len(s):
        raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line))

statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''

for token in tokenize(statements):
    print(token)

여기의 트릭은 라인입니다.

tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)

여기 (?P<ID>PATTERN) 일치하는 결과를 다음으로 지정한 이름으로 표시합니다. ID.

re.match 고정되어 있습니다. 위치 인수를 줄 수 있습니다.

pos = 0
end = len(text)
while pos < end:
    match = regexp.match(text, pos)
    # do something with your match
    pos = match.end()

정규 표현식을 기반으로 다른 구현을 가진 구문 강조 표시를 위해 Lexers의 Shitload를 배송하는 Pygments를 찾으십시오.

토큰 Regexes를 결합하면 효과가 있지만 벤치마킹해야합니다. 같은 것 :

x = re.compile('(?P<NUMBER>[0-9]+)|(?P<VAR>[a-z]+)')
a = x.match('9999').groupdict() # => {'VAR': None, 'NUMBER': '9999'}
if a:
    token = [a for a in a.items() if a[1] != None][0]

필터는 벤치마킹을해야 할 곳입니다 ...

업데이트: 나는 이것을 테스트했으며, 당신이 명시된대로 모든 토큰을 결합하고 다음과 같은 기능을 작성하는 것처럼 보입니다.

def find_token(lst):
    for tok in lst:
        if tok[1] != None: return tok
    raise Exception

이것에 대해 거의 같은 속도 (아마도 십대가 더 빨리)를 얻을 수 있습니다. 나는 속도가 일치하는 통화 수에 있어야한다고 생각하지만, 토큰 차별을위한 루프는 여전히 존재합니다. 물론이를 죽입니다.

이것은 귀하의 질문에 대한 직접적인 답이 아니지만보고 싶을 수도 있습니다. antlr. 에 따르면 이것 문서 Python Code Generation Target은 최신 상태 여야합니다.

당신의 regexes에 관해서는, 당신이 Regexes를 고수하는 경우 속도를 높이는 방법에는 실제로 두 가지 방법이 있습니다. 첫 번째는 기본 텍스트에서 찾을 확률로 Regexes를 주문하는 것입니다. 각 토큰 유형에 대해 토큰 수를 수집 한 코드에 간단한 프로파일 러를 추가하고 작업 본문에서 Lexer를 실행할 수 있습니다. 다른 솔루션은 Regexes를 정렬하는 것입니다 (주요 공간이 캐릭터이기 때문에 비교적 작기 때문에) 첫 번째 문자에 대해 단일 차별을 수행 한 후에 필요한 regexes를 수행하기 위해 배열 또는 사전을 사용합니다.

그러나 나는 당신 이이 길을 가려고한다면, 당신은 정말로 같은 것을 시도해야한다고 생각합니다. antlr 유지하기가 더 쉽고 빠르며 버그가있을 가능성이 적습니다.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow