Вопрос

I'm wondering how to effectively test a lexer (tokenizer).

The number of combinations of tokens in a source file can be huge, and the only way I've found is to make a batch of representative source files and expect an specific sequence of tokens for each of them.

Это было полезно?

Решение

Your grammar probably has some rules for each token on how it can be produced (for example, that a { signifies a BLOCK_START token, or that a string-literal token is delimited by '"' characters). Start writing tests for those rules and verify that your lexer produces the correct token in each case.

Once you have a test for each single token, you can add a few tests for interesting combinations of tokens. Focus here on token combinations that would reveal an error in your lexer. The token combinations don't have to make sense to a parser for your language, so it is entirely valid to use +++++12 as input and expect the tokens INCREMENT, INCREMENT, PLUS, INTEGER_LITERAL(12) as output.

And finally, make sure you have some tests for faulty inputs, where the lexer will not be able to recognize a token. And although I mention them last, they certainly don't have to be the last tests you create. You could just as well start with these.

Другие советы

  • If you're writing the lexer yourself, this seems like an ideal case for test-driven development.

    While “the number of combinations of tokens in a source file can be huge,” the number of branches in your source code is finite. The idea is that before you add a feature in your code—for instance an edge case to handle by the lexer—you start by writing the test first.

  • If you're using an existing lexer that you feed with specific rules, maybe a similar approach can be applied as well. In other words, you start with the very simple syntax (which doesn't do anything useful), and add more and more tests, while complexifying the rules as well.

One alternative that others aren't mentioning, is to use a test generative approach—like QuickCheck from Haskell—to generate the edge cases from the grammar you've defined.

Now, once they're generated, you would then handwrite some additional conditions that you would expect to fail (e.g., assertRaises).

This would have the advantage of automatically updating as your grammar changes, reducing the time spent maintaining tests. It would also have a fun meta-sideeffect, that you would be maintaining tests for your test maintainer ;)

Лицензировано под: CC-BY-SA с атрибуция
scroll top