Best way to tokenize and parse programming languages in my application [closed]

https://stackoverflow.com/questions/2252777

20-09-2019
|

Question

I'm working on a tool that will perform some simple transformations on programs (like extract method). To do this, I will have to perform the first few steps of compilation (tokenizing, parsing and possibly building a symbol table). I'm going to start with C and then hopefully extend this out to support multiple languages.

My question is, what's the best way to perform these steps that:

1.) Doesn't reinvent the wheel. Clearly I don't want to write Flex/Bison specifications by hand. Do I just grab pre-existing specifications and work from there? Is Antlr the way to go here?

2.) Is extendable to multiple languages. Obviously lexing/parsing will be different for everybody, but I would like a solution that I could easily extend to other languages. At least a set of technologies that would make this manageable.

BTW, I'm using C to write my applications

If anyone has any ideas that would be great! Thanks!

Solution

hands down the best way to do any parsing is ANTLR. There are two great books on the subject by the author that are must haves. The Definitive ANTLR Reference: Building Domain Specific Languages, and Language Implementation Patterns, both are invaluable resources. ANTLR can generate processing code in lots of different languages.

OTHER TIPS

Since you are going to use already written grammars and regular expressions you choice of the tool is ininfluent.

You can go with flex / bison and you will find many grammars already written. Otherwise you can go with ANTLR that should work on C, C++ and Java without problems and do the same thing also for it.

You didn't speak about what language are you going to use for this work so suggesting a better approach is not so easy.

Think about the fact that every language has its own features, for example symbol table are constructed in a different way in Ruby compared to C++. That's because you can have stricter or looser declarations and so on.. so you should think well what you are going to need (and you can explain it in your question too, so I can give better help).

Of your two phases I can say that

Tokenizing is quite simple, doesn't require different structures for every language and can be easily extended to support a plethora of programming languages..
Parsing can be more difficult. You have to build up an Abstract Syntax Tree of the program and then do whatever you want on it. If you like to do it OOP style you'll have to use a class for every node type, but node types can change between languages because they're structurally different so doing something general and easily extendable to other language it's quite tricky..

For this point ANTLR wins over Flex and Bison because it offers an automatic generation of AST (if I remember well).

The main difference between these two compiler's compilers is the fact that ANTLR uses an LL(k) parser (that is top-down) while Bison uses a LALR(1) that is bottom-up but if you use already written grammars that shouldn't be that difficult.

A personal advice: I wrote many interpreters or compilers but never started from a fully-featured language. C syntax is really big so maybe you should start from a subset, then see what you can do with tokens and AST and later extend it to support full syntax.

What language do you write your program in?

I'd go with antlr (and actually I go for parsing Java). It supports a lot of languages and also has a lot of example grammars that you get for free http://www.antlr.org/grammar/list. Unfortunately they don't have to be perfect (the Java grammar has no AST rules) but they give you a good start and I suppose the community is quite big for a parser generator.

The great thing with antlr apart from the many language targets is that LL(*) combinded with the predicates supported by antlr is very powerful a easy to understand and the generated parsers are too.

With "extendable to multiple languages" I suppose you mean multiple source languages. This isn't easy but I suppose you might have some success when translating them to ASTs that have as much common symbols as possible and writing a general tree walker that can handle the differences in those languages. But this could be quite difficult.

Be warned, though, that the online documentation is only good once you've read the official antlr book and understand LL(*) and semantic and syntactic predicates.

You didn't specify a language, so I'll just recommend this little gem I found the other day:

http://irony.codeplex.com/

It's super simple to use, and even has grammars pre-built for several languages (C# even). There's also pyparsing (http://pyparsing.wikispaces.com/) if you want to use Python as your source language.

A door to go through is Eclipse. It has parsing, including error tolerant parsing, for a variety of languages. Eclipse has an internal modularity that allows you to exploit this functionality without touching the IDE.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow