How to build Abstract Syntax Trees from grammar specification in Haskell?

Question 1

I've never used bnfc-meta (suggested by @phg), but I would strongly recommend you look into BNFC (on hackage: http://hackage.haskell.org/package/BNFC). The basic approach is that you write your grammar in an annotated BNF style, and it will automatically generate an AST, parser, and pretty-printer for the grammar.

How suitable BNFC is depends upon the complexity of your grammar. If it's not context-free, you'll likely have a difficult time making any progress (I did make some success hacking up context-sensitive extensions, but that code's likely bit-rotted by now). The other downside is that your AST will very directly reflect the grammar specification. But since you already have a BNF specification, adding the necessary annotations for BNFC should be rather straightforward, so it's probably the fastest way to get a usable AST. Even if you decide to go a different route, you might be able to take the generated data types as a starting point for a hand-written version.

Question 2

Well in Haskell there are 2 main ways of parsing something, parse combinators or a parser generator. Since you already have a BNF I'd suggest the latter.

A good one is alex. GHC's parser IIRC is written using this so you'd be in good company.

Next you'll have a big honking stack of data declarations to parse into:

data JavaClass = {
    className :: Name,
    interfaces :: [Name],
    contents :: [ClassContents],
    ...
 }
  data ClassContents = M Method
                     | F Field
                     | IC InnerClass

and for expressions and whatever else you need. Finally you'll combine these into something like

data TopLevel = JC JavaClass
              | WhateverOtherForms
              | YouWillParse

Once you have this you'll have the entire AST represented as one TopLevel or a list of them depending on how many you classes/files you parse.

To proceed from here depends on what you want to do. There are a number of libraries such as syb (scrap your boilerplate) that let you write very concise tree traversals and modifications. lens is also an option. At a minimum check out Data.Traversable and Data.Foldable.

To modify the tree, you can do something as simple as

ignoreInnerClasses :: JavaClass -> JavaClass
ignoreInnerContents c = c{contents = filter isClass $ contents c}
 --                           ^^^ that is called a record update
    where isClass (IC _) = True
          isClass _      = False

and then you could potentially use something like syb to write

 everywhere (mkT ignoreInnerClass) toplevel

which will traverse everything and apply ignoreInnerClass to all JavaClasses. This is possible to do in lens and many other libraries too, but syb is very easy to read.

Question 3

Alex + Happy.

There are many approaches to modify/investigate the parsed terms (ASTs). The keyword to search for is "datatype-generic" programming. But beware: it is a complex topic ...

http://people.cs.uu.nl/andres/Rec/MutualRec.pdf

http://www.cs.uu.nl/wiki/GenericProgramming/Multirec

It has a generic implementation of the zipper available here:

http://hackage.haskell.org/packages/archive/zipper/0.3/doc/html/Generics-MultiRec-Zipper.html

Also checkout https://github.com/pascalh/Astview

Question 4

You might also check out the Haskell Compiler Series which is nice as an introduction to using alex and happy to parse a subset of Java: http://bjbell.wordpress.com/haskell-compiler-series/.

Question 5

Since your grammar can be expressed in BNF, it is in the class of grammars that are efficiently parseable with a shift-reduce parser (LALR grammars). Such efficient parsers can be generated by the parser generator yacc/bison (C,C++), or its Haskell equivalent "Happy".

That's why I would use "Happy" in your case. It takes grammar rules in BNF form and generates a parser from it directly. The resulting parser will accept the language that is described by your grammar rules and produce an AST (abstract syntax tree). The Happy user guide is quite nice and gets you started quickly: http://www.haskell.org/happy/doc/html/

To transform the resulting AST, generic programming is a good idea. Here is a classical explanation on how to do this in Haskell in a practical fashion, from scratch: http://research.microsoft.com/en-us/um/people/simonpj/papers/hmap/

I have used exactly this to build a compiler for a small domain specific language, and it was a simple and concise solution.