How do you write a lexer parser where identifiers may begin with keywords?

Question

Most lexers, starting with the original lex, match alternatives as follows:

Use the longest match.
If there are two or more alternatives which tie for the longest match, use the first one in the lexer definition.

This allows the following style:

"case"                  { return CASE; }
[[:alpha:]][[:alnum:]]* { return ID; }

If the input pattern is caser, then the second alternative will be used because it's the longest match. If the input pattern is case r, then the first alternative will be used because both of them match case, and by rule (2) above, the first one wins.

Although this may seem a bit arbitrary, it's consistent with the DFA approach, mostly. First of all, a DFA doesn't stop the first time it reaches an accepting state. If it did, then patterns like [[:alpha:]][[:alnum:]]* would be useless, because they enter an accepting state on the first character (assuming its alphabetic). Instead, DFA-based lexers continue until there are no possible transitions from the current state, and then they back up until the last accepting state. (See below.)

A given DFA state may be accepting because of two different rules, but that's not a problem either; only the first accepting rule is recorded.

To be fair, this is slightly different from the mathematical model of a DFA, which has a transition for every symbol in every state (although many of them may be transitions to a "sink" state), and which matches a complete input depending on whether or not the automaton is in an accepting state when the last symbol of the input is read. The lexer model is slightly different, but can easily be formalized as well.

The only difficulty in the theoretical model is "back up to the last accepting state". In practice, this is generally done by remembering the state and input position every time an accepting state is reached. This does mean that it may be necessary to rewind the input stream, possibly by an arbitrary amount.

Most languages do not require backup very often, and very few require indefinite backup. Some lexer generators can generate faster code if there are no backup states. (flex will do this if you use -Cf or -CF.)

One common case which leads to indefinite backup is failing to provide an appropriate error return for string literals:

["][^"\n]*["]    { return STRING; }
 /* ... */
.                { return INVALID; }

Here, the first pattern will match a string literal starting with " if there is a matching " on the same line. (I omitted \-escapes for simplicity.) If the string literal is unterminated, the last pattern will match, but the input will need to be rewound to the ". In most cases, it's pointless trying to continue lexical analysis by ignoring an unmatched "; it would make more sense to just ignore the entire remainder of the line. So not only is backing up inefficient, it also is likely to lead to an explosion of false error messages. A better solution might be:

["][^"\n]*["]    { return STRING; } 
["][^"\n]*       { return INVALID_STRING; }

Here, the second alternative can only succeed if the string is unterminated, because if the string is terminated, the first alternative will match one more character. Consequently, it doesn't even matter in which order these alternatives appear, although everyone I know would put them in the same order I did.