Question

I want to work on a very simple compiler for a very simple language. The compiler will compile the code to some basic bytecode. Later, a virtual machine program will run the bytecode, and execute the program.

I have several questions about what bytecode actually is:

  • Does bytecode have to be of binary form, aka composed of only 1 and 0 sequences? Or can it include more kinds of numbers, and even words?

  • How exactly, most commonly, is the bytecode of a program composed? One idea I had for what the bytecode will contain and how it will be composed, using my compiler, is: The compiler will scan the developer's code, and convert each command to some bytecode equivalent. For example, the command print will be converted to 1 in the bytecode. Later, when the VM runs the bytecode, each element in the bytecode will be converted to some execution in the program. For example, if the VM scans the bytecode and comes acorss 1, it will print something. Is this approach common? Is it valid? What is the most common approach to how and what the bytecode composes?

Was it helpful?

Solution

  • there is no need to go binary, you can write out text and interpret that directly. Most VMs will use octet-based binary for efficiency.

  • that is entirely up to the designer. however usually the common instructions (adding 2 numbers, GOTOs,...) will be shorter than the uncommon ones (calling a function, raising an exception,...)

    one way to do this is to let the first few bits of an opcode decide the family (arithmatic, comparison, jumps,...) of the instruction, the next few bits will be the specific instruction, and the last bits will indicate the operands used (if applicable)

OTHER TIPS

What you are doing, effectively, is designing new computer instructions, along with a new computer (your virtual machine) implemented as a program.

Generally, instructions (which is what byte-codes are) have an initial part that the interpreter examines to work out what is coming next. For example, as you have suggested, 1 could mean print the following string. 2 might mean add the following 2 integers.

As you work out what instructions you need, you will find that you need a way of holding data (called the memory organization), and a way of organizing the instructions. Will the instructions do very complex things, or very simple things? If they do complex things (print this string) then many will be needed, but the data can be organized in a simpler way. If the instructions do simple things (load this integer to work area 1) there can be fewer instructions, but the data organization will need to support richer structures and more data types.

Start simple, say with 1-byte op-codes. Expect to go through many code versions as you work it out.

Enjoy. You will learn more about how computers work than you imagine.

For a simple example let's take a look at PHP's Zend Engine. This is a relatively simple VM working purely in memory, thus not having to are about serialisation etc.

In PHP the bytecode consists of 167 different opcodes which can be found in one header: http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_vm_opcodes.h

All those opcodes are stored in a data structure along with some meta information in form of an array (represented as pointer to the first element)

struct _zend_op_array {
    /* ... */
    zend_op *opcodes;
    /* ... */
 }

Each operation then has its own data structure stored in there:

struct _zend_op {
    opcode_handler_t handler;
    znode_op op1;
    znode_op op2;
    znode_op result;
    ulong extended_value;
    uint lineno;
    zend_uchar opcode;
    zend_uchar op1_type;
    zend_uchar op2_type;
    zend_uchar result_type;
};

In there we can see multiple things:

  • The opcode number from the long list of opcodes is stored in the opcode field.
  • Each operation takes two operands (op1 and op2) and has a return value (result)
  • Operands and return value have types, these indicate whether they represent variables, constants or temporary values. In PHP's implementation these are stored in different places and are accessed differently.
  • There is an extended_value which is used i.e. in foreach loops to store some more info
  • The line number from the source file for reporting errors (filename is stored in the zend_op_array strucuture above)
  • The handler is a small optimization, directly pointing to the function implementing this.

While compiling the script the compiler will create these data structures and ensure that the operands and return value show to the same place, so with a call like foo(bar()) the return value of the call to bar will be the same temporary as the operand of the opcode pushing an argument on the functions argument stack. Wait what - argument stack!? you might ask, right: As each opcode only accepts two operands we can't directly pass the function parameters, but first ZEND_SEND_* operations fill a stack, then a ZEND_DO_FCALL/ZEND_DO_FCALL_BY_NAME operation will be executed using that stack.

Using the tool vld we can dump this compiled form:

php -dextension=modules/vld.so -dvld.active -r 'foo(bar());'
Finding entry points
Branch analysis from position: 0
Return found
filename:       Command line code
function name:  (null)
number of ops:  6
compiled vars:  none
line     # *  op                           fetch          ext  return  operands
---------------------------------------------------------------------------------
   1     0  >   INIT_FCALL_BY_NAME                                       'foo', 'foo'
         1      INIT_FCALL_BY_NAME                                       'bar', 'bar'
         2      DO_FCALL_BY_NAME                              0          
         3      SEND_VAR_NO_REF                               4          $0
         4      DO_FCALL_BY_NAME                              1          
         5    > RETURN                                                   null

branch: #  0; line:     1-    1; sop:     0; eop:     5
path #1: 0, 

Next step is execution. This essentially is this loop (shortened):

    while (1) {
            if ((ret = OPLINE->handler(execute_data TSRMLS_CC)) > 0) {
                    switch (ret) {
                            case 1:
                                    return;
                                    /* ... */
                            default:
                                    break;
                    }
            }

    }

OPLINE represents the current element of our array of zend_op structures. Each op has a handler which is executed. (for seeing the correct loop: it is in zend_vm_execute.h generated by zend_vm_gen.php out of zend_vm_execute.skl and zend_vm_def.h) If the handler returns one the script/function/... we are currently executing returns, such the loop ends.

A simple opcode handler can be defined like this (see zend_vm_def.h):

ZEND_VM_HANDLER(1, ZEND_ADD, CONST|TMP|VAR|CV, CONST|TMP|VAR|CV)
{
        USE_OPLINE
        zend_free_op free_op1, free_op2;

        SAVE_OPLINE();
        fast_add_function(&EX_T(opline->result.var).tmp_var,
                GET_OP1_ZVAL_PTR(BP_VAR_R),
                GET_OP2_ZVAL_PTR(BP_VAR_R) TSRMLS_CC);
        FREE_OP1();
        FREE_OP2();
        CHECK_EXCEPTION();
        ZEND_VM_NEXT_OPCODE();
}

This code looks similar to C but is filled into the gen script mentioned ahead. We see that this is the opcode number 1, going by the name of ZEND_ADD which represents an addition. It requires both operands, which both can be of any kind (constants, temporaries, arbitrary variables, compiler cached variables). At the end of the operation the pointer is set to the next opcode in the array.

Virtual machine bytecode is simply just another instruction set. Where we see cases of these instruction sets is generally for portability. You basically only need one compiler or at least one backend.

The portability comes from choosing a new instruction set which is not too hard to implement on the target real instruction sets. A stack based instruction set is usually the answer here for portability but not always what is chosen.

As with any instruction set, new or old, you can choose to represent your instructions in any form basically ascii or binary, both can have many various implementations and result in the same machine code being executed.

If you plan for this instruction set to always be VM based and in particular to sit on top of an operating system, you can certainly if you choose, to implement higher level functions like print or file i/o into the instruction set. Something you would not necessarily do when targeting a real machine code instruction set implemented in hardware, since the hardware itself rarely has such complicated functions in the instruction set. system calls to other piles of code yes, but not direct implementation. The virtual instruction set as far as that instruction set does has whatever you define as the level of functionality of an individual instruction/operation. if you dont have these high level print, file i/o, etc functions in the virtual instruction set then, you will have to implement some other solution like a virtual address space with virtual devices that have to be implemented as the virtual instruction set wont want to know about real hardware details, that defeats the portability and the advantage of one compiler for all platforms.

Licensed under: CC-BY-SA with attribution
scroll top