Question

I've spent a great deal of time reading the LLVM source tree. It is quite an impressive piece of engineering!

Anyhow, I have been trying to convert some MachO Arm Binaries that I have into the LLVM bitcode for basic static analysis. Mainly, I'd like to create backwards static slices on certain calls depending on which registers are used. Additionally, I am trying to do forward propagation of obvious constants (for instance, loading a function name from the symbol table and passing to a register).

At this point, I have been able dump a file and parse it in native ARM assembly using this command line:

    bash-3.2$ llvm-objdump -d ~/code/osx/HelloWorldThin -triple=thumb
    -mattr=+thumb2,+32bit,+v7,+v6t2,+thumb-mode,+neon

    /Users/steve/code/osx/HelloWorldThin:   file format Mach-O arm

    Disassembly of section __TEXT,__text:
    _main:
        2fd4:       f0 b5            push    {r4, r5, r6, r7, lr}
        2fd6:       03 af            add     r7, sp, #12
        2fd8:       4d f8 04 8d      str     r8, [sp, #-4]!
        2fdc:       0d 46            mov     r5, r1
        2fde:       06 46            mov     r6, r0
        2fe0:       00 f0 fe ef      blx     #4092

...snipped...

This is great, as it saves me a bunch of time writing a parser!

After looking through MachODump.cpp, I see that these are lowered to MCInst, which from the way I understand it, is just a parsed opcode with parameters.

So my questions are:

1) Is there a way to convert from ARM to LLVM (for optimization passes, etc)? There is no need to emit back to ARM, only a need to have an analysis result.

1.5) I notice all the analysis operations operate on Instruction instead of MCInst, is there a way to type promote and provide the required information?

2) Is there a way to emulate/simulate ARM or LLVM instructions? I ask because things like slicing and constant propagation need dataflow analysis in order to determine what contents are in memory and registers.

Operations like this, require tracking the way data is loaded and stored from memory, along with registers. Can LLVM understand the side effects of these instructions for analysis?

    __text:000032DE                 LDR             R1, [R0] ; "viewDidLoad"
    __text:000032E0                 MOV             R0, SP
    __text:000032E2                 BLX             _objc_msgSendSuper2

3) If it seems like I have a fundamental misunderstanding of something going on in LLVM, I'd love any feedback.

Thanks and let me know if I can provide any more information about my problem.

Was it helpful?

Solution

For the purpose of static analysis of ARM binaries. It's is better to translate the semantics of each ARM instruction directly to LLVM IR and apply data-flow analysis on the later. For example, an ADD rd, rd, rm in ARM can be translated to LLVM IR %rd2 = add i32 %rd1, %rm1.

Decompilation of ARM machine code to C (for the purpose of recompiling it back to LLVM IR) is both cumbersome and unnecessary. Note that the focus of decompilers like IDA Pro is on binary understanding and not on recompilation per se. Therefore, you would have a hard time recompiling the software back, and even harder time linking your analysis results to the original binary.

The following links might be useful:

  • Fracture is an open source project attempting to directly translate ARM binaries to LLVM IR.
  • LLBT: is a research project that implemented ARM translation to LLVM IR. Their goal, however, is on static binary rewriting rather than binary analysis.

Note that you need a robust disassembler if you are considering analyzing stripped binaries. objdump can emit too much disassembly errors on binaries without symbols.

I'm in the early phases of a research project where we develop a processor description language that can make describing instruction semantics in LLVM IR easier. I'll update this answer when we have more results.

OTHER TIPS

For (1) - not within the framework of LLVM. There's no "decompiler" in there. You're free to use an external decompiler that translates machine code into C, and then compile that into LLVM IR with clang. YMMV with regards to the quality of such a translation, of course.

(1.5) If I understand what you're asking, then no. Instruction and MCInst are quite different animals, very far apart in their abstraction levels. Read this: http://eli.thegreenplace.net/2012/11/24/life-of-an-instruction-in-llvm/

(2) Yes, LLVM has an interpreter you can use from the lli tool. It directly "emulates" LLVM IR without lowering it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top