What opcode dispatch strategies are used in efficient interpreters?

https://stackoverflow.com/questions/511566

21-08-2019
|

Question

What techniques promote efficient opcode dispatch to make a fast interpreter? Are there some techniques that only work well on modern hardware and others that don't work well anymore due to hardware advances? What trade offs must be made between ease of implementation, speed, and portability?

I'm pleased that Python's C implementation is finally moving beyond a simple switch (opcode) {...} implementation for opcode dispatch to indirect threading as a compile time option, but I'm less pleased that it took them 20 years to get there. Maybe if we document these strategies on stackoverflow the next language will get fast faster.

Solution

There are a number of papers on different kinds of dispatch:

M. Anton Ertl and David Gregg, Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters, in Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI 03), pp. 278-288, San Diego, California, June 2003.

M. Anton Ertl and David Gregg, The behaviour of efficient virtual machine interpreters on modern architectures, in Proceedings of the 7th European Conference on Parallel Computing (Europar 2001), pp. 403-412, LNCS 2150, Manchester, August 2001.

An excellent summary is provided by Yunhe Shi in his PhD thesis.

Also, someone discovered a new technique a few years ago which is valid ANSI C.

OTHER TIPS

Befor you start anything, check Lua.

It's small (150Kb), pure ANSI C, works on anything having C compiler. Very fast.

And most important - source code is clean and readable. Worth checking out.

Indirect threading is a strategy where each opcode implementation has its own JMP to the next opcode. The patch to the Python interpreter looks something like this:

add:
    result = a + b;
    goto *opcode_targets[*next_instruction++];

opcode_targets maps the instruction in the language's bytecode to the location in memory of the opcode implementation. This is faster because the processor's branch predictor can make a different prediction for each bytecode, in contrast to a switch statement that has only one branch instruction.

The compiler must support computed goto for this to work, which mostly means gcc.

Direct threading is similar, but in direct threading the array of opcodes is replaced with pointers to the opcode implentations like so:

goto *next_opcode_target++;

These techniques are only useful because modern processors are pipelined and must clear their pipelines (slow) on a mispredicted branch. The processor designers put in branch prediction to avoid having to clear the pipeline as often, but branch prediction only works for branches that are more likely to take a particular path.

Just-in-time compilation is one.

One big win is to store the source code in an intermediate form, rather than redoing lexical analysis and parsing during execution.

This can range all the way from just storing the tokens, through Forth style threaded code and on to JIT compilation.

Benchmarking is a good technique on making anything fast on given platform(s). Test, refine, test again, improve.

I don't think you can get any better answer. There's lot of techniques to make interpreters. But I give you a tip, do not do trade offs, just choose what you really need and pursue those goals.

I found an blog post on threaded interpreter implementation that was useful.

The author describes GCC label-based threading and also how to do this in Visual Studio using inline assembler.

http://abepralle.wordpress.com/2009/01/25/how-not-to-make-a-virtual-machine-label-based-threading/

The results are interesting. He reports 33% performance improvement when using GCC but surprisingly the Visual Studio inline assembly implementation is 3 times slower!

The question is a bit vague. But, it seems you're asking about writing an interpreter.

Interpreters typically utilize traditional parsing components: lexer, parser, and abstract syntax tree (AST). This allows the designer to read and interpret valid syntax and build a tree structure of commands with associated operators, parameters, etc.

Once in AST form, the entire input is tokenized and the interpreter can begin executing by traversing the tree.

There are many options, but I've recently used ANTLR as a parser generator that can build parsers in various target languages, including C/C++ and C#.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow