Why can't native machine code be easily decompiled?

https://softwareengineering.stackexchange.com/questions/229761

02-10-2020
|

質問

With bytecode-based virtual machine languages like Java, VB.NET, C#, ActionScript 3.0, etc., you hear sometimes about how easy it is to just go download some decompiler off the Internet, run the bytecode through it one good time, and oftentimes, come up with something not too far from the original source code in a matter of seconds. Supposedly this sort of language is particularly vulnerable to that.

I've recently started wondering why you don't hear more about this concerning native binary code, when you at least know which language it was written in originally (and thus, which language to try to decompile into). For a long time, I figured it was just because native machine language is so much crazier and more complex than typical bytecode.

But what does bytecode look like? It looks like this:

1000: 2A 40 F0 14
1001: 2A 50 F1 27
1002: 4F 00 F0 F1
1003: C9 00 00 F2

And what does native machine code look like (in hex)? It, of course, looks like this:

1000: 2A 40 F0 14
1001: 2A 50 F1 27
1002: 4F 00 F0 F1
1003: C9 00 00 F2

And the instructions come from a somewhat similar frame of mind:

1000: mov EAX, 20
1001: mov EBX, loc1
1002: mul EAX, EBX
1003: push ECX

So, given the language to try to decompile some native binary into, say C++, what's so hard about it? The only two ideas that immediately come to mind are 1) it really is that much more intricate than bytecode, or 2) something about the fact that operating systems tend to paginate programs and scatter their pieces causes too many problems. If one of those possibilities is correct, please explain. But either way, why do you never hear of this basically?

NOTE

I'm about to accept one of the answers, but I want to kind of mention something first. Almost everybody is referring back to the fact that different pieces of original source code might map to the same machine code; local variable names are lost, you don't know what type of loop was originally used, etc.

However examples like the two that were just mentioned are kind of trivial in my eyes. Some of the answers though tend to state that the difference between machine code and the original source is drastically much more than something this trivial.

But for example, when it comes down to things like local variable names and loop types, bytecode loses this information as well (at least for ActionScript 3.0). I've pulled that stuff back through a decompiler before, and I didn't really care whether a variable was called strMyLocalString:String or loc1. I could still look in that small, local scope and see how it's being used without much trouble. And a for loop is pretty much the same exact thing as a while loop, if you think about it. Also even when I would run the source through irrFuscator (which, unlike secureSWF, doesn't do much more than just randomize member variable and function names), it still looked like you could just start isolating certain variables and functions in smaller classes, figure out how they're used, assign your own names to them, and work from there.

In order for this to be a big deal, the machine code would need to lose a lot more information than that, and some of the answers do go into this.

解決

At every step of compilation you lose information that is irrecoverable. The more information you lose from the original source, the harder it is to decompile.

You can create a useful de-compiler for byte-code because a lot more information is preserved from the original source than is preserved when producing the final target machine code.

The first step of a compiler is to turn the source into some for of intermediate representation often represented as a tree. Traditionally this tree does not contain non-semantic information such as comments, white-space, etc. Once this is thrown away you cannot recover the original source from that tree.

The next step is to render the tree into some form of intermediate language that makes optimizations easier. There are quite a few choices here and each compiler infrastructure has there own. Typically, however, information like local variable names, large control flow structures (such as whether you used a for or while loop) are lost. Some important optimizations typically happen here, constant propagation, invariant code motion, function inlining, etc. Each of which transform the representation into a representation that has equivalent functionality but looks substantially different.

A step after that is to generate the actual machine instructions which might involve what are called "peep-hole" optimization that produce optimized version of common instruction patterns.

At each step you lose more and more information until, at the end, you lose so much it become impossible to recover anything resembling the original code.

Byte-code, on the other hand, typically saves the interesting and transformative optimizations until the JIT phase (the just-in-time compiler) when the target machine code is produced. Byte-code contains a lot of meta-data such as local variable types, class structure, to allow the same byte-code to be compiled to multiple target machine code. All this information is not necessary in a C++ program and is discarded in the compilation process.

There are decompilers for various target machine codes but they often do not produce useful results (something you can modify and then recompile) as too much of the original source is lost. If you have debug information for the executable you can do an even better job; but, if you have debug information, you probably have the original source too.

他のヒント

The loss of information as pointed out by the other answers is one point, but it is not the dealbreaker. After all, you don't expect the original program back, you just want any representation in a high-level language. If code is inlined, you can just let it be, or automatically factor out common computations. You can in principle undo many optimizations. But there are some operations that are in principle irreversible (without an infinite amount of computing at least).

For example, branches might become computed jumps. Code like this:

select (x) {
case 1:
    // foo
    break;
case 2:
    // bar
    break;
}

might get compiled into (sorry that this is not real assembler):

0x1000:   jump to 0x1000 + 4*x
0x1004:   // foo
0x1008:   // bar
0x1012:   // qux

Now, if you know that x can be 1 or 2, you can look at the jumps and reverse this easily. But what about address 0x1012? Should you create a case 3 for it, too? You'd have to trace the whole program in the worst case to figure out what values are allowed. Even worse, you might have to consider all possible user inputs! At the core of the problem is that you can't tell data and instructions apart.

That being said, I wouldn't be entirely pessimistic. As you might have noticed in the above 'assembler', if x comes from outside and is not guaranteed to be 1 or 2, you essentially have a bad bug that allows you to jump to anywhere. But if your program is free from this kind of bug, it is much easier to reason about. (It is no accident that "safe" intermediate languages like CLR IL or Java bytecode are much easier to decompile, even setting metadata aside.) So in practice, it should be possible to decompile certain, well-behaved programs. I'm thinking of individual, functional style routines, that have no side effects and well-defined inputs. I think there are a couple of decompilers around that can give pseudocode for simple functions, but I don't have much experience with such tools.

The reason why machine code cannot be easily convertible back to the original source code is that a lot of information is lost during compilation. Methods and non-exported classes can be inlined, local variable names are lost, file names and structures are lost entirely, compilers can make non-obvious optimizations. Another reason is that multiple different source files could produce the exact same assembly.

For example:

int DoSomething()
{
    return Add(5, 2);
}

int Add(int x, int y)
{
    return x + y;
}

int main()
{
    return DoSomething();
}

May be compiled to:

main:
mov eax, 7;
ret;

My assembly is pretty rusty, but if the compiler can verify that an optimization can be done accurately, it will do so. This is due to the compiled binary not needing to know the names DoSomething and Add, as well as the fact that the Add method has two named parameters, the compiler also knows that the DoSomething method essentially returns a constant, and it could inline both the method call and the method itself.

The compiler's purpose is to create an assembly, not a way to bundle up source files.

The general principles here are many-to-one mappings and lack of canonical representatives.

For a simple example of many-to-one phenomenon you can think about what happens when you take a function with some local variables and compile it to machine code. All the information about the variables gets lost because they just become memory addresses. Something similar happens for loops. You can take a for or while loop and if they are structured just right then you might get identical machine code with jump instructions.

This also brings up lack of canonical representatives from the original source code for the machine code instructions. When you try to decompile loops how do you map the jump instructions back to looping constructs? Do you make them for loops or while loops.

The issue is further exasperated by the fact that modern compilers perform various forms of folding and inlining. So by the time you get to the machine code it's pretty much impossible to tell what high level constructs the low level machine code came from.

ライセンス： CC-BY-SA と帰属

所属していません softwareengineering.stackexchange