Is it possible to create a “bootstrapped” interpreter independent of the original interpreter?

https://softwareengineering.stackexchange.com/questions/348476

11-01-2021
|

Pergunta

According to Wikipedia, the term "bootstrapping" in the context of writing compilers means this:

In computer science, bootstrapping is the process of writing a compiler (or assembler) in the source programming language that it intends to compile. Applying this technique leads to a self-hosting compiler.

And I can understand how that would work. However, the story seems to be a little different for interpreters. Now, of course, it is possible to write a self-hosting interpreter. That's not what I'm asking. What I'm actually asking is: Is it possible to make a self-hosted interpreter independent of the original, first interpreter. To explain what I mean, consider this example:

You write your first interpreter version in language X, and the interpreter is for a new language you're creating, called Y. You first use language X's compiler to create an executable. You now can interpret files written in your new language Y using the interpreter written in language X.

Now, as far as I understand, to be able to "bootstrap" the interpreter you wrote in language X, you'd need to rewrite the interpreter in language Y. But here is the catch: even if you do rewrite the entire interpreter in language Y, you're still going to need the original interpreter you wrote in language X. Because to run the interpreter in language Y, you're going to have to interpret the source files. But what exactly is going to interpret the source files? Well, it can't be nothing, of course, so you're forced to still use the first interpreter.

No matter how many new interpreters you write in language Y, you're always going to have to use the first interpreter written in X to interpret the subsequent interpreters. This seems to be a problem simply because of the nature of interpreters.

However, on the flip side, This Wikipedia article on interpreters actually talks about self-hosting interpreters. Here is a small excerpt which is relevant:

A self-interpreter is a programming language interpreter written in a programming language which can interpret itself; an example is a BASIC interpreter written in BASIC. Self-interpreters are related to self-hosting compilers.

If no compiler exists for the language to be interpreted, creating a self-interpreter requires the implementation of the language in a host language (which may be another programming language or assembler). By having a first interpreter such as this, the system is bootstrapped and new versions of the interpreter can be developed in the language itself

It's still not clear to me though, how exactly this would be done. It seems that no matter what, you're always going to be forced to use the first version of your interpreter written in the host language.

Now the article mentioned above links to another article in which Wikipedia gives some examples of supposed self-hosting interpreters. Upon closer inspection though, it seems that the main "interpreting" part of many of those self-hosting interpreters (especially some of the more common ones such as PyPy or Rubinius) are actually written in other languages such as C++ or C.

So is what I describe above possible? Can a self-hosted interpreter be independent of its original host? If so, how exactly would this be done?

Solução

The short answer is: you are right in your suspicion, you always need either another interpreter written in X or a compiler from Y to some other language for which you have an interpreter already. Interpreters execute, compilers only translate from one language to another, at some point in your system, there must be an interpreter … even it's just the CPU.

No matter how many new interpreters you write in language Y, you're always going to have to use the first interpreter written in X to interpret the subsequent interpreters. This seems to be a problem simply because of the nature of interpreters.

Correct. What you can do is write a compiler from Y to X (or another language for which you have an interpreter), and you can even do that in Y. Then you can run your Y compiler written in Y on the Y interpreter written in X (or on the Y interpreter written in Y running on the Y interpreter written in X, or on the Y interpreter written in Y running on the Y interpreter written in Y running on the Y interpreter written in X, or … ad infinitum) to compile your Y interpreter written in Y to X, so that you can then execute it on an X interpreter. That way, you have gotten rid of your Y interpreter written in X, but now you need the X interpreter (we know that we already have one, though, since otherwise we couldn't ran the X interpreter written in Y), and you had to write a Y-to-X-compiler first.

However, on the flip side, The Wikipedia article on interpreters actually talks about self-hosting interpreters. Here is a small excerpt which is relevant:

A self-interpreter is a programming language interpreter written in a programming language which can interpret itself; an example is a BASIC interpreter written in BASIC. Self-interpreters are related to self-hosting compilers.

If no compiler exists for the language to be interpreted, creating a self-interpreter requires the implementation of the language in a host language (which may be another programming language or assembler). By having a first interpreter such as this, the system is bootstrapped and new versions of the interpreter can be developed in the language itself

It's still not clear to me though, how exactly this would be done. It seems that no matter what, you're always going to be forced to use the first version of your interpreter written in the host language.

Correct. Note that the Wikipedia article explicitly says that you need a second implementation of your language, and it doesn't say that you can get rid of the first.

Now the article mentioned above links to another article in which Wikipedia gives some examples of supposed self-hosting interpreters. Upon closer inspection though, it seems that the main "interpreting" part of many of those self-hosting interpreters (especially some of the more common ones such as PyPy or Rubinius) are actually written in other languages such as C++ or C.

Again, correct. Those are really bad examples. Take Rubinius, for example. Yes, it's true that the Ruby part of Rubinius is self-hosted, but it is a compiler, not an interpreter: it compiles to Ruby source code to Rubinius bytecode. The interpreter part OTOH isn't self-hosted: it interprets Rubinius bytecode, but it is written in C++. So, calling Rubinius a "self-hosted interpreter" is wrong: the self-hosted part isn't an interpreter, and the interpreter part isn't self-hosted.

PyPy is similar, but even more incorrect: it isn't even written in Python in the first place, it is written in RPython, which is a different language. It is syntactically similar to Python, semantically an "extended subset", but it actually is a statically-typed language roughly on the same abstraction level as Java, and its implementation is a compiler with multiple backends which compiles RPython to C source code, ECMAScript source code, CIL byte code, JVM bytecode, or Python sourcecode.

So is what I describe above possible? Can a self-host interpreter be independent of its original host? If so, how exactly would this be done?

No, not on its own. You would either need to keep the original interpreter or write a compiler and compile your self-interpreter.

There are some meta-circular VMs, such as Klein (written in Self) and Maxine (written in Java). Note, however, that here the definition of "meta-circular" is yet different: these VMs are not written in the language they execute: Klein executes Self bytecode but is written in Self, Maxine executes JVM bytecode but is written in Java. However, the Self / Java source code of the VM actually gets compiled to Self / JVM bytecode and then executed by the VM, so by the time the VM gets executed, it is in the language it executes. Phew.

Note also that this is different from VMs such as the SqueakVM and the Jikes RVM. Jikes is written in Java, and the SqueakVM is written in Slang (a statically typed syntactic and semantic subset of Smalltalk roughly on the same abstraction level as a high-level assembler), and both get statically compiled to native code before they are run. They don't run inside of themselves. You can, however, run them on-top of themselves (or on top of another Smalltalk VM / JVM). But that is not "meta-circular" in this sense.

Maxine and Klein, OTOH do run inside of themselves; they execute their own bytecode using their own implementation. This is truly mindbending! It allows some cool optimization opportunities, for example since the VM executes itself together with the user program, it can inline calls from the user program to the VM and vice versa, e.g. call to the garbage collector or the memory allocator can be inlined into user code, and reflective callbacks in the user code can be inlined into the VM. Also, all of the clever optimization tricks that modern VMs do, where they watch the executing program and optimize it depending on the actual workload and data, the VM can apply those same tricks to itself while it is executing the user program while the user program is executing the specific workload. In other words, the VM highly specialized itself for that particular program running that particular workload.

However, notice I skirted around the use of the word "interpreter" above, and always used "execute"? Well, those VMs aren't built around interpreters, they are built around (JIT) compilers. There was an interpreter added to Maxine later, but you always need the compiler: you have to run the VM once on top of another VM (e.g. Oracle HotSpot in the case of Maxine), so that the VM can (JIT) compile itself. In the case of Maxine, it will JIT compile its own bootup phase, then serialize that compiled native code to a bootstrap VM image and stick a very simple bootloader in front (the only component of the VM written in C, although that's just for convenience, it could be in Java also). Now you can use Maxine to execute itself.

Outras dicas

You are correct in noting that a self-hosting interpreter still requires an interpreter to run itself, and can't be bootstrapped in the same sense as a compiler.

However, a self-hosted language is not the same thing as a self-hosted interpreter. It is usually easier to build an interpreter than it is to build a compiler. Therefore, to implement a new language, we might first implement an interpreter in an unrelated language. Then we can use that interpreter to develop a compiler for our language. The language is then self-hosted, since the compiler is interpreted. The compiler can then compile itself, and can then be considered fully bootstrapped.

A special case of this is a self-hosting JIT-compiling runtime. It can start with an interpreter in a host language, which then uses the new language to implement JIT-compilation, after which the JIT-compiler can compile itself. This feels like a self-hosting interpreter, but sidesteps the infinite-interpreters problem. This approach is used, but is not yet very common.

Another related technique is an extensible interpreter, where we can create extensions in the language being interpreted. For example, we might implement new opcodes in the language. This can turn a bare-bones interpreter into a feature-rich interpreter, as long as we avoid circular dependencies.

A case that actually occurs fairly commonly is the ability of the language to influence its own parsing, e.g. as parse-time evaluated macros. Since the macro language is the same as the language being processed, it tends to be far more feature-rich than dedicated or restricted macro languages. However, it is correct to note that the language performing the extension is a slightly different language than the language after extension.

When “real” self-hosted interpreters are used, this is usually done for reasons of education or research. E.g. implementing an interpreter for Scheme inside Scheme is a cool way to teach programming languages (see SICP).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange