FSM-based Lexer
Writing a finite state machine (FSM) lexer by hand, you loop over the characters in your input and process that with two levels of switch statements:
- the outer switch statement is switching on the state;
- the inner switch statement is switching on the character that was read.
This is an implementation of a finite state machine to process the tokens. The downside to this is that it can get complex to maintain, especially if there are many states to process each token. The upside is that it is easier to refactor to take advantage of finite state machines (e.g. using transition tables instead of switch statements).
Transition tables are like the switch statements: the rows define the states, the columns define the data values and the cells define the next state to transition to (with something like -1
to signal to stop processing). With this approach, the end state can be used to determine the token type. Here, you will have a token_type tokens[N_STATES];
array that you can then do token = tokens[current_state]
to get the token.
Non-FSM Lexer
The alternative approach is to switch on the first character and then read the rest of the characters in that token as part of that case statement. This can be easier to read and simpler to write.
You can also split the characters into different classes (e.g. numbers, letters, minus and less than) that you can define as a 256 item lookup table. This can simplify the case statements.
Regular Expressions
Using a big regular expression is problematic as you noted, as you cannot get the token type. An approach here is to have a list of regular expressions that match the token and associate that with the token type. For example, in python:
_tokens = [
(re.compile('\\s+'), WhiteSpace),
(re.compile('[a-zA-Z_][a-zA-Z0-9_]*'), Identifier),
(re.compile('[0-9]+'), Integer),
]
You will need to order these correctly (e.g. placing keyword matching before identifiers).