What are the tradeoffs of performing static analysis on source code, byte code, machine code, etc?

https://stackoverflow.com/questions/7901576

13-02-2021
|

Question

What are the various tradeoffs for performing static analysis on various levels of code? For instance for Java, why would someone perform static analysis on Java source code vs. Jasmin code vs. Java bytecode? Does the choice restrict or expand the various types of analyses able to be done? Does the choice influence the correctness of the analyses? Thanks.

Solution

What are the various tradeoffs for performing static analysis on various levels of code? For instance for Java, why would someone perform static analysis on Java source code vs. Java bytecode?

From a user perspective, I'd say that, unless you have very specific, easy to formalize, properties to analyze (such as pure safety properties) go with a tool that supports Java source code.

From a tool-developer perspective, it may be easier to work with one level or another. I here present the differences that come to my mind. (Note that with a compiler and/or a decent decompiler a tool for instance operate on one layer and present the results on another.)

Pros for Java source code:

Structured language, i.e. loops etc, instead of arbitrary jumps. (This makes it a lot easier to create a weakest precondition calculus for instance.)
You can make more assumptions in the code (bytecode programs are more expressive).

Pros for Bytecode:

The language specification (the semantics of the bytecode instructions) are a lot simpler.
A more "pinned down" specification of the machine (the VM)
You can extend the analysis to legacy code and libraries.
Analysis allows for other languages targeting the JVM (Closure, Scala, JRuby...)
No need for a possibly complex parser

Pros for machine code:

You verify what you actually feed the CPU with. (No need to use a verified compiler or verified VM if you want a fully verified chain.)

State of the art tools such as Spec# etc (formal methods dialect of C#) usually go through an intermediate language (BoogiePL (neighter MSIL nor C#) in the Spec# case) specifically designed for formal analysis.

Does the choice restrict or expand the various types of analyses able to be done?

In the end... no, not really. You face the same fundamental problems regardless of which (Turing complete) language you choose to analyze. Depending on what properties you analyze, YMMV though.

If you're into formal methods and thinking about implementing an analysis yourself, I suspect you'll find better tool-support for bytecode. If you're a user or developer and want to perform analysis on your own code-base, I suspect you'll benefit more from tools operating on Java-source code level.

Does the choice influence the correctness of the analyses?

Depends on what you mean by correctness. A static analysis is most often "defensive" in the sense that you don't assume anything that you don't know is true. If you restrict your attention to sound verification systems, all of them will be "equally correct".

OTHER TIPS

IntelliJ has static analysis for comments e.g. Javadoc and parameter names which is not available in the byte code. e.g. spelling mistakes and name inconsistencies. Analysis of code ensures you have line numbers and position within a line of any issue.

The benefit of analysing byte code is that its much simpler and may be all you need. You might have line numbers but you won't have the position. And you can analise compiled code which you don't have the source for, e.g. libraries.

What are the various tradeoffs for performing static analysis on various levels of code? For instance for Java, why would someone perform static analysis on Java source code vs. Jasmin code vs. Java bytecode?

Think of it this way. If you get negative results (results indicating or suggesting a negative or detrimental attribute) from Jasmin or bytecode, what would you do about it? How would you go about addressing that in a manner that is timely and cost effective?

Now consider the scenario where static analysis on the source code (most likely your source code or code that you own) comes back reporting a negative/detrimental attribute that needs addressing?

Do you think you will have a harder time addressing this detrimental aspect that is being mapped to source code than doing the same to a detrimental aspect (possibly similar or related) but this time mapped to bytecode or Jasmin?

The thing is that 1) Jasmin is expected to be a one-to-one representation of legitimate bytecode, and 2) that bytecode has been generated by a bona-fide compiler. The chances that a problem in bytecode map directly to a problem introduced in source code in the presence of a well-behaved compiler are very minimal.

Independently of whether problems detected at the bytecode level are a result of problems introduced at the source code level or the result of a faulty compiler/environment, these problems are typically not actionable(sp?). You typically cannot act upon it, at least not directly.

Problems detected at the source code level, OTH, they are efficiently actionable. That is, you can get your hands on it and fix them (and by inference, removing any problems at the byte code derived from the former.)

There are things that can be detected at the byte code level, in particular in the context of packaging (ie. packaging unnecessary libraries.) But hardly you ever need to do verification at the byte code level.

Unless you are in the business of compiler and language design (in this case targeting the VM), for efficiency and practicality purposes, 1) you assume the compiler is correct, and that 2) given the way the JVM is spec'ed, you also assume the compiler performs verification at compile time and the JVM does verification at run-time.

Does the choice restrict or expand the various types of analyses able to be done? Does the choice influence the correctness of the analyses? Thanks.

How do you define correctness? What is correctness in this context? And how could it affect correctness? Are we talking correctness at the type system level? Partial and/or total correctness? Correctness with respect to attributes such as fairness, liveliness? Correctness of the analysis process itself? Correctness with respect to meeting one or more requirements?

Define your terms dude :)

Regardless, you have to assume the compiler is doing a sufficiently correct translation of your code into the target instruction set (again, unless you are in the business of compiler/language design.)

If you work on the assumption that the "native" representation of your code is correct (that is, it "maps" to it according to a desired target platform and a type system), then you narrow your field of verification down to your source code for the attributes you want to verify.

Another consideration is 'abstraction will lost high level informations'. We are doing it with source code(high level), because we need where expression are occur in source code.

a source-to-binary mapping is very important in source code visualization area.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow