For the purpose of static analysis of ARM binaries. It's is better to translate the semantics of each ARM instruction directly to LLVM IR and apply data-flow analysis on the later. For example, an ADD rd, rd, rm
in ARM can be translated to LLVM IR %rd2 = add i32 %rd1, %rm1
.
Decompilation of ARM machine code to C (for the purpose of recompiling it back to LLVM IR) is both cumbersome and unnecessary. Note that the focus of decompilers like IDA Pro
is on binary understanding and not on recompilation per se. Therefore, you would have a hard time recompiling the software back, and even harder time linking your analysis results to the original binary.
The following links might be useful:
- Fracture is an open source project attempting to directly translate ARM binaries to LLVM IR.
- LLBT: is a research project that implemented ARM translation to LLVM IR. Their goal, however, is on static binary rewriting rather than binary analysis.
Note that you need a robust disassembler if you are considering analyzing stripped binaries. objdump
can emit too much disassembly errors on binaries without symbols.
I'm in the early phases of a research project where we develop a processor description language that can make describing instruction semantics in LLVM IR easier. I'll update this answer when we have more results.