Question

I'm creating a compiler with Lex and YACC (actually Flex and Bison). The language allows unlimited forward references to any symbol (like C#). The problem is that it's impossible to parse the language without knowing what an identifier is.

The only solution I know of is to lex the entire source, and then do a "breadth-first" parse, so higher level things like class declarations and function declarations get parsed before the functions that use them. However, this would take a large amount of memory for big files, and it would be hard to handle with YACC (I would have to create separate grammars for each type of declaration/body). I would also have to hand-write the lexer (which is not that much of a problem).

I don't care a whole lot about efficiency (although it still is important), because I'm going to rewrite the compiler in itself once I finish it, but I want that version to be fast (so if there are any fast general techniques that can't be done in Lex/YACC but can be done by hand, please suggest them also). So right now, ease of development is the most important factor.

Are there any good solutions to this problem? How is this usually done in compilers for languages like C# or Java?

Was it helpful?

Solution

It's entirely possible to parse it. Although there is an ambiguity between identifiers and keywords, lex will happily cope with that by giving the keywords priority.

I don't see what other problems there are. You don't need to determine if identifiers are valid during the parsing stage. You are constructing either a parse tree or an abstract syntax tree (the difference is subtle, but irrelevant for the purposes of this discussion) as you parse. After that you build your nested symbol table structures by performing a pass over the AST you generated during the parse. Then you do another pass over the AST to check that identifiers used are valid. Follow this with one or more additional parses over the AST to generate the output code, or some other intermediate datastructure and you're done!

EDIT: If you want to see how it's done, check the source code for the Mono C# compiler. This is actually written in C# rather than C or C++, but it does use .NET port of Jay which is very similar to yacc.

OTHER TIPS

One option is to deal with forward references by just scanning and caching tokens till you hit something you know how to real with (sort of like "panic-mode" error recovery). Once you have run thought the full file, go back and try to re parse the bits that didn't parse before.

As to having to hand write the lexer; don't, use lex to generate a normal parser and just read from it via a hand written shim that lets you go back and feed the parser from a cache as well as what lex makes.

As to making several grammars, a little fun with a preprocessor on the yacc file and you should be able to make them all out of the same original source

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top