Question

To have a general-purpose documentation system that can extract inline documentation of multiple languages, a parser for each language is needed. A parser generator (which actually doesn't have to be that complete or efficient) is thus needed.

http://antlr.org/ is a nice parser generator that already has a number of grammars for popular languages. Are there better alternatives i.e. simpler ones that support generating parsers for even more languages out-of-the-box?

Was it helpful?

Solution

If you're only looking for "partial parsing", then you could use ANTLR's option to partially "lex" a token stream and ignore the rest of the tokens. You can do that by enabling the filter=true in a lexer-grammar. The lexer then tries to match any token you defined in your grammar, and when it can't match one of the tokens, it advances one single character (and ignores it) and then again tries to match one of your token at the next character:

lexer grammar Foo;

options {filter=true;}

StringLiteral
  :  ...
  ;

CharLiteral
  :  ...
  ;

SingleLineComment
  :  ...
  ;

MultiLineComment
  :  ...
  ;

When implemented properly, you can get the MultiLineComments (/* ... */) from a Java file quite easily without being afraid of single line comments and String- or char literals messing things up.

Obviously, your source files need to be valid to be able to properly tokenize a file, otherwise you get strange results!

OTHER TIPS

My compiler uses Dypgen. This is a user extenisble GLR parser with lots of enrichments so it can parse many languages. The bootstrap grammar is EBNF like (it supports * + and ? directly in your productions). It is powerful enough to dynamically load extensions, a fact my compiler leverages: the bulk of my programming language has its syntax dynamically loaded at compiler startup.

Dypgen is written in Ocaml and generates Ocaml code.

There is a C++ GLR parser called Elkhound which is powerful enough to parse most of C++.

However, for your actual requirements, you do not really need to do any serious parsing: a regular expression matching engine is probably good enough. Googles re2 may be suitable (provides most PCRE functionality, a lot faster and with C++ interface).

Although this is less accurate, it is good enough because you can demand that inline documentation adhere to some simple formats. Most existing inline docs already do so for just this reason.

Where I work we used to use GOLD Parser. This is a lot simpler that Antlr and supports multiple languages. We have since moved to Antlr however as we needed to do more complex parsing, which we found Antlr was better for than GOLD.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top