Optimizing a word parser

Question 1

Aho-Corasick is a very cool algorithm but it's not ideal for keyword matches, because keyword matches are aligned; you can't have overlapping matches because you only match a complete identifier.

For the basic identifier lookup, you just need to build a trie out of your keywords (see note below).

Your basic algorithm is fine: find the beginning of the identifier, and then see if it's a keyword. It's important to improve both parts. Unless you need to deal with multibyte characters, the fastest way to find the beginning of a keyword is to use a 256-entry table, with one entry for each possible character. There are three possibilities:

The character can not appear in an identifier. (Continue the scan)
The character can appear in an identifier but no keyword starts with the character. (Skip the identifier)
The character can start a keyword. (Start walking the trie; if the walk cannot be continued, skip the identifier. If the walk finds a keyword and the next character cannot be in an identifier, skip the rest of the identifier; if it can be in an identifier, try continuing the walk if possible.)

Actually steps 2 and 3 are close enough together that you don't really need special logic.

There is some imprecision with the above algorithm because there are many cases where you find something that looks like an identifier but which syntactically cannot be. The most common cases are comments and quoted strings, but most languages have other possibilities. For example, in C you can have hexadecimal floating point numbers; while no C keyword can be constructed just from [a-f], a user-supplied word might be:

0x1.deadbeef

On the other hand, C++ allows user-defined numeric suffixes, which you might well want to recognize as keywords if the user adds them to the list:

274_myType

Beyond all of the above, it's really impractical to parse a million lines of code every time the user types a character in an editor. You need to develop some way of caching tokenization, and the simplest and most common one is to cache by input line. Keep the input lines in a linked list, and with every input line also record the tokenizer state at the beginning of the line (i.e., whether you're in a multi-line quoted string; a multi-line comment, or some other special lexical state). Except in some very bizarre languages, edits cannot affect the token structure of lines preceding the edit, so for any edit you only need to retokenize the edited line and any subsequent lines whose tokenizer state has changed. (Beware of working too hard in the case of multi-line strings: it can create lots of visual noise to flip the entire display because the user types an unterminated quote.)

Note: For smallish (hundreds) numbers of keywords, a full trie doesn't really take up that much space, but at some point you need to deal with bloated branches. One very reasonable datastructure, which can be made to perform very well if you're careful about data layout, is a ternary search tree (although I'd call it a ternary search trie.)

Question 2

It will be hard to beat this code.

Suppose your keywords are "a", "ax", and "foo".

Take the list of keywords, sorted, and feed it into a program that prints out code like this:

switch(pc[0]){
  break; case 'a':{
    if (0){
    } else if (strcmp(pc, "a")==0 && !alphanum(pc[1])){
      // push "a"
      pc += 1;
    } else if (strcmp(pc, "ax")==0 && !alphanum(pc[2])){
      // push "ax"
      pc += 2;
    }
  }
  break; case 'f':{
    if (0){
    } else if (strcmp(pc, "foo")==0 && !alphanum(pc[3])){
      // push "foo"
      pc += 3;
    }
    // etc. etc.
  }
  // etc. etc.
}

Then if you don't see a keyword, just increment pc and try again. The point is, by dispatching on the first character, you quickly get into the subset of keywords starting with that character. You might even want to go to two levels of dispatch.

Of course, as always, take some stack samples to see what the time is being used for. Regardless, if you have data structure classes, you're going to find that consuming a large part of your time, so keep that to a minimum (throw religion to the wind :)

Question 3

The fastest way to do it would be a finite state machine built to the word set. Use Lex to build the FSM.

Question 4

The best algorithm for this problem is probably Aho-Corasick. There already exist C implementations, e.g.,

http://sourceforge.net/projects/multifast/