Output language/format for toy compiler

https://stackoverflow.com/questions/10473824

06-06-2021
|

Question

I took a compilers course in university, and it was very informative and a lot of fun, although also a lot of work. Since we were given a language specification to implement, one thing I didn't learn much about was language design. I'm now thinking of creating a simple toy language for fun, so that I can play around and experiment with different language design principles.

One thing I haven't decided as of yet is what language or format I'd like my compiler to output. Ideally, I'd like to output bytecode for a virtual machine which is easy to use and also has some facilities for debugging (e.g. being able to pause execution and look at the stack at any point.) I've not found one that struck my fancy yet, though. To give you an idea of what I'm looking for, here are some of the options I've considered, along with their pros and cons as I see them:

I could output textual x86 assembly language and then invoke an assembler like NASM or FASM. This would give me some experience compiling for actual hardware, as my previous compiler work was done on a VM. I could probably debug the generated programs using gdb, although it might not be as easy as using a VM with debugging support. The major downside to this is that I have limited experience with x86 assembly, and as a CISC instruction set it's a bit daunting.
I could output bytecode for a popular virtual machine like the JVM or Lua virtual machine. The pros and cons of these are likely to vary according to which specific VM I choose, but in general the downside I see here is potentially having to learn a bytecode which might have limited applicability to my future projects. I'm also not sure which VM would be best suited to my needs.
I could use the same VM used in my compilers course, which was designed at my university specifically for this purpose. I am already familiar with its design and instruction set, and it has decent debugging features, so that's a huge plus. However, it is extremely limited in its capabilities and I feel like I would quickly run up against those limits if I tried to do anything even moderately advanced.
I could use LLVM and output LLVM Intermediate Representation. LLVM IR seems very powerful and being familiar with it could definitely be of use to me in the future. On the other hand, I really have no idea how easy it is to work with and debug, so I'd greatly appreciate advice from someone experienced in that area.
I could design and implement my own virtual machine. This has a huge and obvious downside: I'd essentially be turning my project into two projects, significantly decreasing the likelihood that I'd actually get anything done. However, it's still somewhat appealing in that it would allow me to make a VM which had "first-class" support for the language features I want—for instance, the Lua VM has first-class support for tables, which makes it easy to work with them in Lua bytecode.

So, to summarize, I'm looking for a VM or assembler I can target which is relatively easy to learn and work with, and easy to debug. As this is a hobby project, ideally I'd also like to minimize the chance that I spend a great deal of time learning some tool or language that I'll never use again. The main thing I hope to gain from this exercise is some first-hand understanding of the complexities of language design, though, so anything that facilitates a relatively quick implementation will be great.

Solution

It really depends on how complete a language you want to build, and what you want to do with it. If you want to create a full-blown language for real projects that interacts with other languages, your needs are going to be much greater than if you just want to experiment with the complexities of compiling particular language features.

Output to an assembly language file is a popular choice. You can annotate the assembly language file with the actual code from your program (in comments). That way, you can see exactly what your compiler did for each language construct. It might be possible (it's been a long time since I worked with these tools) to annotate the ASM file in a way that makes source-level debugging possible.

If you're going to be working in language design, then you'll almost certainly need to know x86 assembly language. So the time you spend learning it won't be wasted. And the CISC instruction set really isn't a problem. It'll take you a few hours of study to understand the registers and the different addressing modes, and probably less than a week to be somewhat proficient, provided you've already worked with some other assembly language (which it appears you have).

Outputting byte code for JVM, lua, or .NET is another reasonable approach, although if you do that you tie yourself to the assumptions made by the VM. And, as you say, it's going to require detailed knowledge of the VM. It's likely that any of the popular VMs would have the features you need, so selection is really a matter of preference rather than capabilities.

LLVM is a good choice. It's powerful and becoming increasingly popular. If you output LLVM IR, you're much more likely to be able to interact with others' code, and have theirs interact with yours. Knowing the workings of LLVM is a definite plus if you're looking to get a job in the field of compilers or language design.

I would not recommend designing and implementing your own virtual machine before you get a little more experience with other VMs so that you can see and understand the tradeoffs that they made in implementation. If you go down this path, you'll end up studying JVM, lua, .NET, and many other VMs. I'm not saying not to do it, but rather that doing so will take you away from your stated purpose of exploring language design.

Knowledge is rarely useless. Whatever you decide to use will require you to learn new things. And that's all to the good. But if you want to concentrate on language design, select the output format that requires the least amount of work that's not specifically language design. Consistent, of course, with capabilities.

Of your options, it looks to me like your university's VM is out. I would say that designing your own VM is out, too. Of the other three, I'd probably go with LLVM. But then, I'm very familiar with x86 assembly so the idea of learning LLVM is somewhat appealing.

OTHER TIPS

Have a look at my Programming Languages ZOO. It has a number of toy implementations of languages, including some virtual machines and made-up assembly (a stack machine). It should help you get started.

If your just playing around with language design what about an interpreted language? Having the whole AST still around at run time lets you do some very cool things.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow