Can you decompile a game down to its original source code?

Question 1

No, because the mapping from instructions to code is not 1:1.

No, the compiler mangles the structure of your program, there is no other word for it, scheduling and the quest to reduce register pressure at certain points can mean instructions from the same operation can be up to 150,000 instructions away from each other ( IIRC this is the stock cap on GCC, you can change it with a -f option of course :P)

No, no, no.

The only promise the complication process offers is that the result will work as if it actually did what the programmers wrote. That's it.

Looking at Stuxnet was interesting (yes, not a game, I know) and practical because it was small, the parts of the program driving the scene graph alone will be huge and so well optimised. I'd also be shocked if they didn't use link time optimisation which removes even more of the structure.

this answer lacks a lot of detail, but that's because one explaining everything would be huge, you obviously have no idea how this works and it's good you want to learn.

http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf

I've linked this many-a-time, it's got some examples of code mapping to register instructions. That isn't optimised and they are small samples for a much simpler (sort of, depends how you look at it) machine, can you see how difficult even reversing these would be?

Lastly, debugging with -O3 is a joke, we have -Og now, where the compiler optimises but avoids structure-changing optimisations so debugging doesn't jump around so much, when you use -g the resulting object files are littered with the code they came from and stuff, above the instructions they generated. Fun facts!

Question 2

You can't recover original source code - the process of compilation is inherently lossy and some detail will inevitably be lost. How much is lost will depend on the source language, target language and choices made by developers.

Let's start with the easy cases - a high-level language compiled to its own bytecode. For example, Python to .pyc, C# to .NET IL (.dll), Java to .class/.dex. In each of these examples, the bytecode contains direct representations of high-level concepts in the language such as classes, methods, virtual function calls, class layouts, etc. Decompilers exist that will restore shockingly accurate source code from the compiled code.

Here's a brief example in Python. Original source:

class MyClass:
    def function(self, a, b):
        print("Hello, world:", a, b)

MyClass().function("test", 1234.5678)

Compiled with Python 3.6, and decompiled again using uncompyle6:

# uncompyle6 version 3.3.5
# Python bytecode 3.6 (3379)
# Decompiled from: Python 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28) 
# [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
# Embedded file name: /private/tmp/test.py
# Compiled at: 2019-12-23 16:34:01
# Size of source mod 2**32: 121 bytes


class MyClass:

    def function(self, a, b):
        print('Hello, world:', a, b)


MyClass().function('test', 1234.5678)
# okay decompiling __pycache__/test.cpython-36.pyc

Aside from some extra comments and spaces, the output is basically 1:1 with the original. Java and C# are similarly easy to decompile. Many games are written in Java (e.g. Android) and C# (e.g. Unity), and there are a lot of modders/hackers using decompilers to obtain usable source code for games written in these languages.

A developer can choose to defend against a decompiler by using obfuscation, where they deliberately mangle the compiled output in some way (e.g. renaming variables/functions/classes to gibberish names) to make this type of reverse engineering harder.

The harder cases is when you take code and compile it all the way down to machine code (code that runs directly on the CPU). Languages like Rust, Go, C++, Swift all compile straight to machine code by default. CPU instructions don't correspond 1-to-1 to concepts in the high-level language. Now, there are decompilers - the NSA's recently open-sourced Ghidra decompiler is one of the best out there - but they can only give you a very crude approximation of the original source, and most only decompile to C (not all the way to Rust/Go/C++/Swift/etc.). Here's a simple C++ program:

#include <iostream>

class MyClass {
public:
  void function(const char *a, const double b) {
    std::cout << "Hello, world: " << a << " " << b << std::endl;
  }
};

int main() {
  MyClass m;
  m.function("test", 1234.5678);
}

Here's how Ghidra 9.1 decompiles it:


// MyClass::function(char const*, double)

void __thiscall MyClass::function(MyClass *this,char *param_1,double param_2)

{
  char cVar1;
  basic_ostream *pbVar2;
  size_t sVar3;
  long *plVar4;
  long *plVar5;
  undefined local_20 [8];
  
  pbVar2 = std::__1::__put_character_sequence<char,std--__1--char_traits<char>>
                     ((basic_ostream *)__ZNSt3__14coutE,"Hello, world: ",0xe);
  sVar3 = __stubs::_strlen(param_1);
  pbVar2 = std::__1::__put_character_sequence<char,std--__1--char_traits<char>>
                     (pbVar2,param_1,sVar3);
  pbVar2 = std::__1::__put_character_sequence<char,std--__1--char_traits<char>>(pbVar2," ",1);
  plVar4 = (long *)__stubs::__ZNSt3__113basic_ostreamIcNS_11char_traitsIcEEElsEd(param_2,pbVar2);
  __stubs::__ZNKSt3__18ios_base6getlocEv(local_20,*(long *)(*plVar4 + -0x18) + (long)plVar4);
  plVar5 = (long *)__stubs::__ZNKSt3__16locale9use_facetERNS0_2idE(local_20,__ZNSt3__15ctypeIcE2idE)
  ;
  cVar1 = (**(code **)(*plVar5 + 0x38))(plVar5,10);
  __stubs::__ZNSt3__16localeD1Ev(local_20);
  __stubs::__ZNSt3__113basic_ostreamIcNS_11char_traitsIcEEE3putEc(plVar4,(ulong)(uint)(int)cVar1);
  __stubs::__ZNSt3__113basic_ostreamIcNS_11char_traitsIcEEE5flushEv(plVar4);
  return;
}


undefined8 entry(void)

{
  MyClass local_10 [8];
  
  MyClass::function(local_10,"test",1234.56780000);
  return 0;
}

An experienced reverse engineer can make sense of this - but it's a lot less nice.

So there you have it. If you're reverse engineering a program compiled to native CPU code, you can get source but it's going to be pretty rough. If you're reverse engineering a program compiled to some intermediate bytecode, you'll have a better time. In all cases, you can't get exactly the original source code, but you might be able to get pretty close.

Question 3

The other answers aren't accurate.

There are several reverse engineering projects out there which perfectly reconstruct 1:1 accurate C code and compile to the exact same bytes given the original compiler. Please see https://github.com/pret/pokeemerald . Of course you lose names and comments but it is not accurate to say no to this question here. It's perfectly possible to construct recompilable matching C code (purely in this narrow case, anyway.), it's just really tedious and a question of permutation through sets of C fast enough to find a matching member.

The actual answer? Yes. Will you be able to reasonably find 1:1 matching members for every function? Probably not.

Question 4

Not always easy. How hard is it to take two prime numbers and multiply them together? Easy, Ok, how hard is it to take a big number and determine it's prime components? Very difficult if the number is big enough.

That same is true for decompiling code. Trying to figure out what c or c++ code generated some assembly code you've got is very difficult for all but the smallest and easiest cases. In some cases the decompiler fails and can't generate c code and you're stuck trying to figure out what some massive block of assembly means.

Worse, some critical parts might have never been written in c or c++ in the first place, and so the developer might have written some assembly code that can't be translated to a higher language because it does things that don't have mirror concepts in the higher language.

Worse, some developers put their code through an obfuscation program afterward, and now the decompiler's already hard job just got vastly harder