Question

I mean, interpreters work on a list of instructions, which seem to be composed more or less by sequences of bytes, usually stored as integers. Opcodes are retrieved from these integers, by doing bit-wise operations, for use in a big switch statement where all operations are located.

My specific question is: How do the object values get stored/retrieved?

For example, let's (non-realistically) assume:

  1. Our instructions are unsigned 32 bit integers.
  2. We've reserved the first 4 bits of the integer for opcodes.

If I wanted to store data in the same integer as my opcode, I'm limited to a 24 bit integer. If I wanted to store it in the next instruction, I'm limited to a 32 bit value.

Values like Strings require lots more storage than this. How do most interpreters get away with this in an efficient manner?

Was it helpful?

Solution

I'm going to start by assuming that you're interested primarily (if not exclusively) in a byte-code interpreter or something similar (since your question seems to assume that). An interpreter that works directly from source code (in raw or tokenized form) is a fair amount different.

For a typical byte-code interpreter, you basically design some idealized machine. Stack-based (or at least stack-oriented) designs are pretty common for this purpose, so let's assume that.

So, first let's consider the choice of 4 bits for op-codes. A lot here will depend on how many data formats we want to support, and whether we're including that in the 4 bits for the op code. Just for the sake of argument, let's assume that the basic data types supported by the virtual machine proper are 8-bit and 64-bit integers (which can also be used for addressing), and 32-bit and 64-bit floating point.

For integers we pretty much need to support at least: add, subtract, multiply, divide, and, or, xor, not, negate, compare, test, left/right shift/rotate (right shifts in both logical and arithmetic varieties), load, and store. Floating point will support the same arithmetic operations, but remove the logical/bitwise operations. We'll also need some branch/jump operations (unconditional jump, jump if zero, jump if not zero, etc.) For a stack machine, we probably also want at least a few stack oriented instructions (push, pop, dupe, possibly rotate, etc.)

That gives us a two-bit field for the data type, and at least 5 (quite possibly 6) bits for the op-code field. Instead of conditional jumps being special instructions, we might want to have just one jump instruction, and a few bits to specify conditional execution that can be applied to any instruction. We also pretty much need to specify at least a few addressing modes:

  1. Optional: small immediate (N bits of data in the instruction itself)
  2. large immediate (data in the 64-bit word following the instruction)
  3. implied (operand(s) on top of stack)
  4. Absolute (address specified in 64 bits following instruction)
  5. relative (offset specified in or following instruction)

I've done my best to keep everything about as minimal as is at all reasonable here -- you might well want more to improve efficiency.

Anyway, in a model like this, an object's value is just some locations in memory. Likewise, a string is just some sequence of 8-bit integers in memory. Nearly all manipulation of objects/strings is done via the stack. For example, let's assume you had some classes A and B defined like:

class A { 
    int x;
    int y;
};

class B { 
    int a;
    int b;
};

...and some code like:

A a {1, 2};
B b {3, 4};

a.x += b.a;

The initialization would mean values in the executable file loaded into the memory locations assigned to a and b. The addition could then produce code something like this:

push immediate a.x   // put &a.x on top of stack
dupe                 // copy address to next lower stack position
load                 // load value from a.x
push immediate b.a   // put &b.a on top of stack
load                 // load value from b.a
add                  // add two values
store                // store back to a.x using address placed on stack with `dupe`

Assuming one byte for each instruction proper, we end up around 23 bytes for the sequence as a whole, 16 bytes of which are addresses. If we use 32-bit addressing instead of 64-bit, we can reduce that by 8 bytes (i.e., a total of 15 bytes).

The most obvious thing to keep in mind is that the virtual machine implemented by a typical byte-code interpreter (or similar) isn't all that different from a "real" machine implemented in hardware. You might add some instructions that are important to the model you're trying to implement (e.g., the JVM includes instructions to directly support its security model), or you might leave out a few if you only want to support languages that don't include them (e.g., I suppose you could leave out a few like xor if you really wanted to). You also need to decide what sort of virtual machine you're going to support. What I've portrayed above is stack-oriented, but you can certainly do a register-oriented machine if you prefer.

Either way, most of object access, string storage, etc., comes down to them being locations in memory. The machine will retrieve data from those locations into the stack/registers, manipulate as appropriate, and store back to the locations of the destination object(s).

OTHER TIPS

Bytecode interpreters that I'm familiar with do this using constant tables. When the compiler is generating bytecode for a chunk of source, it is also generating a little constant table that rides along with that bytecode. (For example, if the bytecode gets stuffed into some kind of "function" object, the constant table will go in there too.)

Any time the compiler encounters a literal like a string or a number, it creates an actual runtime object for the value that the interpreter can work with. It adds that to the constant table and gets the index where the value was added. Then it emits something like a LOAD_CONSTANT instruction that has an argument whose value is the index in the constant table.

Here's an example:

static void string(Compiler* compiler, int allowAssignment)
{
  // Define a constant for the literal.
  int constant = addConstant(compiler, wrenNewString(compiler->parser->vm,
      compiler->parser->currentString, compiler->parser->currentStringLength));

  // Compile the code to load the constant.
  emit(compiler, CODE_CONSTANT);
  emit(compiler, constant);
}

At runtime, to implement a LOAD_CONSTANT instruction, you just decode the argument, and pull the object out of the constant table.

Here's an example:

CASE_CODE(CONSTANT):
  PUSH(frame->fn->constants[READ_ARG()]);
  DISPATCH();

For things like small numbers and frequently used values like true and null, you may devote dedicated instructions to them, but that's just an optimization.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top