How to create binaries for virtual machines? [closed]

Question 1

Alright, I'll bite on this generic question.

Implementing an compiler/assembler/vm combo is a tall order, especially if you're doing it by yourself. That being said: If you keep your language specification simple enough, it is quite doable; also by yourself.

Basically, to create a binary, the following is done (this is a tad bit simplified*:

1) Input source is read, lexed, and tokenized

2) The program logic is analyzed for semantical correctness.

E.g. while the following C++ would parse & tokenize, it would fail semantic analysis

float int* double = const (_identifier >><<) operator& *

3) Build an Abstract Syntax Tree to represent the statements

4) Build symbol tables and resolve identifiers

5) Optional: Optimization of code

6) Generate code in an output format of your choice; for example binary opcodes/operands, string tables. Whatever format suits your needs best. Alternatively, you could create bytecode for an existing VM, or for a native CPU.

EDIT If you want to devise your own bytecode format, you can write, for example:

1) File Header
DWORD filesize
DWORD checksum
BYTE  endianness;
DWORD entrypoint <-- Entry point for first instruction in main() or whatever
2) String table
DWORD numstrings
<strings>
DWORD stringlen
<string bytes/words>

3) Instructions
DWORD numinstructions
<instructions>
DWORD opcode
DWORD numops <--- or deduce from opcode
DWORD op1_type <--- stack index, integer literal, index to string table, etc
DWORD operand1
DWORD op1_type
DWORD operand2
...

END

Overall, the steps are managable, but, as always, the devil is in the details.

Some good references are:

The Dragon Book - This is heavy on theory, so it's a dry read, but worthwhile

Game Scripting Mastery - Guides you along while developing all three components in a more practical matter. However, the example code is rife with security issues, memory leaks, and overall lousy coding style (imho). However, you can take a lot of concepts away from this book, and it's worth a read.

The Art of Compiler Design - I have not read this one personally, but heard positive things about it.

If you decide to go down this road, be sure you know what you're getting yourself into. This is not something some the faint of heart, or someone new to programming. It requires a lot of conceptual thinking and prior planning. It is, however, quite rewarding and fun

Question 2

@APott -

1) Virtual machines don't create binaries. The Java compiler creates binary .class files; a running JVM loads and executes class files.

2) There's nothing particularly "new" or unique about the Java JVM. Conceptually, it's not dissimilar to UCSD Pascal or IBM MV/370. Here's a good short history of VM's:

http://cap-lore.com/Software/CP.html

3) If you're interested, the complete JVM specification is on-line, and there are many books/links that discuss it in detail:

Question 3

All that a compiler does is transform a string to a string, whether the target is a real machine or a virtual machine. Since you're building your own target VM, you might use different way to encode than existing virtual or physical machines instructions sets, but that doesn't really change. All physical machine instruction set can be emulated in software, and all virtual machine instruction set can be run in hardware (though this could be slightly harder in practice since instruction set designed for virtual machine can be much more complex than the hardware budget allows). The CPU, after all, is just an interpreter of an instructions set.

Any compiler books should expand on this, but compilation process is the same for physical or virtual machine. In general, you need to start with parsing your source language into a source code abstract syntax tree (AST), then you need a translation that transform this source AST into target AST (though the target language are generally much flatter than the source language, so you might not actually need a tree but an array is usually sufficient), then you need code generation to transform the target AST into bytecode (this is usually just one to one translation from the target AST node to bytecode). For languages with complicated syntaxes, you may need to have intermediate parsing stages to form concrete syntax trees a.k.a. parse tree before you can form the source AST; and some compilers may use multiple translation stages, and may include an optimizing translator in between; those are minor differences.