Get Control flow graph from Abstract Syntax Tree

https://stackoverflow.com/questions/92537

01-07-2019
|

Question

I have an AST derived from the ANTLR Parser Generator for Java. What I want to do is somehow construct a control flow graph of the source code, where each statement or expression is a unique Node. I understand there must be some recursiveness to this identification, I was wondering what you would suggest as the best option and if ANTLR has a toolset I can use for this job. Cheers, Chris

EDIT - My main concern is to get a control flow graph(CFG) from the AST. This way I can get a tree representation of the source. To clarify, both the source code and the implementation language is Java.

Solution

Usually CFGs are computed on a lower-level representation (e.g. JVM bytecode). Someone did a thesis on such things a few years ago. There might be a helpful way described in there for how to get at that representation.

Since your source and target languages are the same, there's no code generation step -- you're already done! However, now you get to walk the AST. At each node of the AST, you have to ask yourself: is this a "jumping" instruction or not? Method calls and if statements are examples of jumping instructions. So are loop constructs (such as for and while). Instructions such as addition and multiplication are non-jumping.

First associate with each java statement a node in the CFG, along with an entry and exit node. As a first approximation, walk the tree and:

if the current statement is a method call, figure out where the entry node is for the corresponding body of that method call, and make an edge pointing from the current statement to that entry node. if the statement is a method return, enumerate the places that could have called it and add an edge to those.
for each non-jumping statement, make an edge between it and the next statement.

This will give you some kind of CFG. The procedure is slightly hairy in step 2 because the method called may be declared in a library, and not elsewhere in the AST -- if so, either don't make an edge or make an edge to a special node representing the entry to that library method.

Does this make sense?

OTHER TIPS

Producing a full control flow graph that really takes into account all the language issues is harder than it looks. Not only do you have to identify what appears to be the "basic blocks", but you have to identify function calls (sort of easy, but identifying the target might be harder), where behind-the-scenes operations such as class initializers can happen. and to worry about the points where exceptions can occur and where control goes if an exception does occur.

If you examine most languages carefully, they will also be clear about ordering of evaluation of computations in expressions, and this matters if you have two side-effects in an expression; the control flow should reflect the order (or the non-order, if it isn't defined).

Maybe you only want an abstraction of the control flow having the basic blocks and the conditionals. That's obviously a bit easier.

In either case (simple CFG or full CFG), you need to walk the AST, at each point having a reference to possible control flow targets (e.g., for most cases, such as IF statements, there are two flow targets: the THEN and ELSE clauses). At each node, link that node to the appropriate control flow target, possibly replacing the flow targets (e.g., when you encounter an IF).

To do this for the full language semantics of Java (or C) is quite a lot of work. You may want to simply use a tool that computes this off-the-shelf. See http://www.semanticdesigns.com/Products/DMS/FlowAnalysis.html for what this really looks like, coming out of our tools.

Based on some comments, it sounds like the OP really wants to do code generation -- to convert the AST into a lower-level sequence of instructions based on basic blocks and jump points.

Code generation is very language-specific, and a lot of work has been put into this topic. Before you do code generation you need to know your target language -- whether it be assembler or simply some other high-level language. Once you have identified this, you simply need to walk the AST and generate a sequence of instructions that implements the code in the AST. (I say this is simple, but it can be hard -- it's hard to generalise because the considerations here are pretty language-specific.)

The representation you choose for code generation will contain the control-flow graph, implicitly or explicitly. If your target language is fairly low-level (close to assembler), then the control-flow graph should be relatively easy to extract.

(Please comment if you'd like more clarification.)

Did you ever tryed ANTLR Studio? It does not generate the hole AST graph, but for review, its already quite helpful.

When I have done this in the past, I used graphviz, in particular the dot tool, to generate the graph. I created the dot input file by actually traversing the control-flow graph at compile time.

Graph layout is a hard problem, and graphviz does an excellent job. It can output to ps, pdf, and various image formats, and the layout is usually pretty intuitive to look at. I highly recommend it.

I don't think I'll be able to answer your question in a way that you are maybe looking for since I don't know of any way in ANTLR to produce a CFG with or without an AST. But, in short you would use what ANTLR produces to generate a separate Java program to produce a CFG. You would utilize the ANTLR generated syntax tree as input to generate your CFG in a separate Java program of your own creation. At this point you are, in essence, building a compiler. The difference between your "compiler" and a JVM is that your output is a visual representation (CFG) of how a program branches its various execution paths and a JVM/Java compiler produces code for execution on a real machine (CPU).

An analogy is if someone sits down to write a book (in English for example), the individual words used in sentences are the TOKENS of a computer language, sentences are formed in a similar manner that context free grammars express valid computer code, and paragraphs & whole novels tell a story in a similar manner that semantic analysis/compilers/CFGs might produce/represent logically valid programs that actually do something useful and are more or less free of logic bugs. In other words, once you get past the issue of valid syntax (correct sentence structure), anyone can write a bunch of sentences on a page but only certain combinations of sentences produce text that actually does something (tell a story).

What you're asking about is that last piece - how to go about taking a syntax tree and transforming or interpreting what the AST actually does logically. And of course you would need to build a "compiler" for each language you want to do this for. Having a correct grammar doesn't tell you what a program does - just that a program is correct from a grammar perspective.

Linters and syntax highlighters and IDEs are all built around the idea of trying to make this last piece of the puzzle an easier and a more efficient task for humans.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow