Are compilers and language specification/grammars the same?

https://stackoverflow.com/questions/4473305

11-10-2019
|

Question

Can a compiler be written, from which you can not reverse engineer the grammar and meaning of the input language.

i.e. can you always get the specification of the language from the compiler?

Let's say I want to compile from ?? to some language but I do not want people who read the compiler to be able to read & understand ??

I personally have a feeling that compilers and language specifications are isomorphic but I'm interested from an academic point of view whether this is wrong.

Solution

Assuming you're saying they only have access to the binary:

Short answer: Not if a person cares enough.

Long answer: It is always possible, if a person was so inclined and had lots of free time, to rip the compiler down to the byte level and map it completely. From there, you could figure out the logic trees, and reconstruct the language.

It would be painful, but this falls under the same category as "can i ever make an algorithm that prevents a dedicated user from cracking the cd-key verification".

Now, if you never actually gave the compiler to a person (imagine some sort of proxy system?) it might be reasonable to say that a user would have to take a very, very long time to brute force the language specifications, if he could ever generate something that could exercise it completely.

If you're implying that they have access to the source code:

No. You can obfuscate it, but the compiler still has to construct the same logical trees, no matter how difficult to read.

There might be some esoteric way to do this..if you provided the language tree separately in some sort of encrypted binary form...and didn't supply the compiler's source..and your users weren't bored NSA types.

OTHER TIPS

I think the compiler always reveals the specification of the language that it compiles (I'm aware that this is super hand-wavy).

However, there is probably no algorithm to do so (i.e. it is undecidable), because, for e.g. that algorithm would need to find out which programs the compiler will halt on.

No, they are not the same. But a compiler inevitably understands the input language's grammar, and (hopefully) follows the language specification very precisely. Therefore, understanding the compiler means understanding those.

Of course it's possible to obfuscate the compiler source code so heavily that no one will bother to read it and extract the grammar and the language's rules. Of course that hurts developers too (good luck maintaining that crap!).

Also, reading the compiler's source would be my last option if I wanted to know something about the language (not about how it's implemented but how it's defined on an more abstract level) - I'd head to the spec or some other authorative source (official docs etc), as it would be way easier even if the compiler's code is very understandable.

My gut feeling is that you could determine the semantic behavior of the compiler by inspecting it's output. But you couldn't obtain the actual syntax without documentation or access to the compiler source. if you have the source this becomes trivial, so I'm assuming you don't have the source of the compiler, just access to it as a tool.

In general, if there is an information about semantics in the code (and there's always an operational semantics defined in any interpreter or compiler), then it's always possible to extract that information. The only question is the complexity of such a reverse engineering. So, you need an obfuscated language and an obfuscated compiler.

Take a look at the Malbolge "decompiler" for example.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow