문제

Hey guys i am trying to understand some concepts regarding lexers. I understand that lexers are used in compilers to separate individual characters in a string into the form known as tokens. But the thing that confuses me is the matching part. I do not understand the logic of why we need to match the characters to the corresponding position.

import sys
import re

def lex(characters, token_exprs):
    pos = 0
    tokens = []
    while pos < len(characters):
        match = None
        for token_expr in token_exprs:
            pattern, tag = token_expr
            regex = re.compile(pattern)
            match = regex.match(characters, pos)
            if match:
                text = match.group(0)
                if tag:
                    token = (text, tag)
                    tokens.append(token)
                break
        if not match:
            sys.stderr.write('Illegal character: %s\n' % characters[pos])
            sys.exit(1)
        else:
            pos = match.end(0)
    return tokens

This is the code that i do not completely understand. After the for loop, i do not quite grasp what the code is trying to do.Why do we have to match the characters to the position?

도움이 되었습니까?

해결책

A pretty traditional lexer can work something like this:

  1. Get a character from somewhere, be it a file or a buffer
  2. Check what the current character is:
    • Is it a whitespace? Skip all whitespace
    • Is it a comment introduction character? Get and skip the comment
    • Is it a digit? Then try to get a number
    • Is it a "? Then try to get a string
    • Is it a character? Then try to get an identifier
      • Is the identifier a keyword/reserved word?
    • Otherwise, is it a valid operator sequence?
  3. Return the token type

Instead of checking single characters at a time, you can of course use regular expressions.


The best way to learn how a hand-written lexer works, is (IMO) to find simple existing lexers and try to understand them.

다른 팁

It doesn't match "characters to the position". The "pos" parameters is given to seek the pattern only in the part of the "characters" string - starting from index=pos to end. So the code tries to match given tokens in the given order to the given string. After a token is found in the string, the next tokens are being matched only to the remaining part of the string. It's not a lexer strictly speaking, as it does a bit more than a lexer should do (refer to Joachim Pileborg's answer or to Lexer's definition).

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top