Frage

Hey guys i am trying to understand some concepts regarding lexers. I understand that lexers are used in compilers to separate individual characters in a string into the form known as tokens. But the thing that confuses me is the matching part. I do not understand the logic of why we need to match the characters to the corresponding position.

import sys
import re

def lex(characters, token_exprs):
    pos = 0
    tokens = []
    while pos < len(characters):
        match = None
        for token_expr in token_exprs:
            pattern, tag = token_expr
            regex = re.compile(pattern)
            match = regex.match(characters, pos)
            if match:
                text = match.group(0)
                if tag:
                    token = (text, tag)
                    tokens.append(token)
                break
        if not match:
            sys.stderr.write('Illegal character: %s\n' % characters[pos])
            sys.exit(1)
        else:
            pos = match.end(0)
    return tokens

This is the code that i do not completely understand. After the for loop, i do not quite grasp what the code is trying to do.Why do we have to match the characters to the position?

War es hilfreich?

Lösung

A pretty traditional lexer can work something like this:

  1. Get a character from somewhere, be it a file or a buffer
  2. Check what the current character is:
    • Is it a whitespace? Skip all whitespace
    • Is it a comment introduction character? Get and skip the comment
    • Is it a digit? Then try to get a number
    • Is it a "? Then try to get a string
    • Is it a character? Then try to get an identifier
      • Is the identifier a keyword/reserved word?
    • Otherwise, is it a valid operator sequence?
  3. Return the token type

Instead of checking single characters at a time, you can of course use regular expressions.


The best way to learn how a hand-written lexer works, is (IMO) to find simple existing lexers and try to understand them.

Andere Tipps

It doesn't match "characters to the position". The "pos" parameters is given to seek the pattern only in the part of the "characters" string - starting from index=pos to end. So the code tries to match given tokens in the given order to the given string. After a token is found in the string, the next tokens are being matched only to the remaining part of the string. It's not a lexer strictly speaking, as it does a bit more than a lexer should do (refer to Joachim Pileborg's answer or to Lexer's definition).

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top