How to deal with branch prediction when using a switch case in CPU emulation

https://stackoverflow.com//questions/11668090

12-12-2019
|

Question

I recently read the question here Why is it faster to process a sorted array than an unsorted array? and found the answer to be absolutely fascinating and it has completely changed my outlook on programming when dealing with branches that are based on Data.

I currently have a fairly basic, but fully functioning interpreted Intel 8080 Emulator written in C, the heart of the operation is a 256 long switch-case table for handling each opcode. My initial thought was this would obviously be the fastest method of working as opcode encoding isn't consistent throughout the 8080 instruction set and decoding would add a lot of complexity, inconsistency and one-off cases. A switch-case table full of pre-processor macros is a very neat and easy to maintain.

Unfortunately, after reading the aforementioned post it occurred to me that there's absolutely no way the branch predictor in my computer can predict the jumping for the switch case. Thus every time the switch-case is navigated the pipeline would have to be completely wiped, resulting in a several cycle delay in what should otherwise be an incredibly quick program (There's not even so much as multiplication in my code).

I'm sure most of you are thinking "Oh, the solution here is simple, move to dynamic recompilation". Yes, this does seem like it would cut out the majority of the switch-case and increase speed considerably. Unfortunately my primary interest is emulating older 8-bit and 16-bit era consoles (the intel 8080 here is only an example as it's my simplest piece of emulated code) where cycle and timing keeping to the exact instruction is important as the Video and Sound must be processed based on these exact timings.

When dealing with this level of accuracy performance becomes an issue, even for older consoles (Look at bSnes for example). Is there any recourse or is this simply a matter-of-fact when dealing with processors with long pipelines?

Solution

On the contrary, switch statements are likely to be converted to jump tables, which means they perform possibly a few ifs (for range checking), and a single jump. The ifs shouldn't cause a problem with branch prediction because it is unlikely you will have a bad op-code. The jump is not so friendly with the pipeline, but in the end, it's only one for the whole switch statement..

I don't believe you can convert a long switch statement of op-codes into any other form that would result in better performance. This is of course, if your compiler is smart enough to convert it to a jump table. If not, you can do so manually.

If in doubt, implement other methods and measure performance.

Edit

First of all, make sure you don't confuse branch prediction and branch target prediction.

Branch prediction solely works on branch statements. It decides whether a branch condition would fail or succeed. They have nothing to do with the jump statement.

Branch target prediction on the other hand tries to guess where the jump will end up in.

So, your statement "there's no way the branch predictor can predict the jump" should be "there's no way the branch target predictor can predict the jump".

In your particular case, I don't think you can actually avoid this. If you had a very small set of operations, perhaps you could come up with a formula that covers all your operations, like those made in logic circuits. However, with an instruction set as big as a CPU's, even if it were RISK, the cost of that computation is much higher than the penalty of a single jump.

OTHER TIPS

As the branches on your 256-way switch statement are densely packed the compiler will implement this as a jump table, so you're correct in that you'll trigger a single branch mispredict every time you pass through this code (as the indirect jump won't display any kind of predictable behaviour). The penalty associated with this will be around 15 clock cycles on a modern CPU (Sandy Bridge), or maybe up to 25 on older microarchitectures that lack a micro-op cache. A good reference for this sort of thing is "Software optimisation resources" on agner.org. Page 43 in "Optimizing software in C++" is a good place to start.

http://www.agner.org/optimize/?e=0,34

The only way you could avoid this penalty is by ensuring that the same instructions are execution regardless of the value of the opcode. This can often be done by using conditional moves (which add a data dependency so are slower than a predictable branch) or otherwise looking for symmetry in your code paths. Considering what you're trying to do this is probably not going to be possible, and if it was then it would almost certainly add a overhead greater than the 15-25 clock cycles for the mispredict.

In summary, on a modern architecture there's not much you can do that'll be more efficient than a switch/case, and the cost of mispredicting a branch isn't as much as you might expect.

The indirect jump is probably the best thing to do for instruction decoding.

On older machines, like say the Intel P6 from 1997, the indirect jump would probably get a branch misprediction.

On modern machines, like say Intel Core i7, there is an indirect jump predictor that does a fairly good job of avoiding the branch misprediction.

But even on the older machines that do not have an indirect branch predictor, you can play a trick. This trick is (was), by the way, documented in the Intel Code Optimization Guide from way back in the Intel P6 days:

Instead of generating something that looks like

    loop:
       load reg := next_instruction_bits // or byte or word
       load reg2 := instruction_table[reg]
       jmp [reg]
    label_instruction_00h_ADD: ...
       jmp loop
    label_instruction_01h_SUB: ...
       jmp loop
    ...

generate the code as

    loop:
       load reg := next_instruction_bits // or byte or word
       load reg2 := instruction_table[reg]
       jmp [reg]
    label_instruction_00h_ADD: ...
       load reg := next_instruction_bits // or byte or word
       load reg2 := instruction_table[reg]
       jmp [reg]
    label_instruction_01h_SUB: ...
       load reg := next_instruction_bits // or byte or word
       load reg2 := instruction_table[reg]
       jmp [reg]
    ...

i.e. replace the jump to the top of the instruction fetch/decode/execute loop by the code at the top of the loop at each place.

It turns out that this has much better branch prediction, even in the absence of an indirect predictor. More precisely, a conditional, single target, PC indexed BTB will be quite a lot better in this latter, threaded, code, than on the original with only a single copy of the indirect jump.

Most instruction sets have special patterns - e.g. on Intel x86, a compare instruction is nearly always followed by a branch.

Good luck and have fun!

(In case you care, the instruction decoders used by instruction set simulators in industry nearly always do a tree of N-way jumps, or the data-driven dual, navigate a tree of N-way tables, with each entry in the tree pointing to other nodes, or to a function to evaluate.

Oh, and perhaps I should mention: these tables, these switch statements or data structures, are generated by special purpose tools.

A tree of N-way jumps, because there are problems when the number of cases in the jump table gets very large - in the tool, mkIrecog (make instruction recognizer) that I wrote in the 1980s, I usually did jump tables up to 64K entries in size, i.e. jumping on 16 bits. The compilers of the time broke when the jump tables exceeded 16M in size (24 bits).

Data driven, i.e. a tree of nodes pointing to other nodes because (a) on older machines indirect jumps may not be predicted well, and (b) it turns out that much of the time there is common code between instructions - instead of having a branch misprediction when jumping to the case per instruction, then executing common code, then switching again, and getting a second mispredict, you do the common code, with slightly different parameters (like, how many bits of the instruction stream do you consume, and where the next set of bits to branch on is (are).

I was very aggressive in mkIrecog, as I say allowing up to 32 bits to be used in a switch, although practical limitations nearly always stopped me at 16-24 bits. I remember that I often saw the first decode as a 16 or 18 bit switch (64K-256K entries), and all other decodes were much smaller, no bigger than 10 bits.

Hmm: I posted mkIrecog to Usenet back circa 1990. ftp://ftp.lf.net/pub/unix/programming/misc/mkIrecog.tar.gz You may be able to see the tables used, if you care. (Be kind: I was young then. I can't remember if this was Pascal or C. I have since rewritten it many times - although I have not yet rewritten it to use C++ bit vectors.)

Most of the other guys I know who do this sort of thing do things a byte at a time - i.e. an 8 bit, 256 way, branch or table lookup.)

I thought I'd add something since no one mentioned it.

Granted, the indirect jump is likely to be the best option.

However, should you go with the N-compare way, there are two things that come to my mind:

First, instead of doing N equality compares, you could do log(N) inequality compares, testing your instructions based on their numerical opcode by dichotomy (or test the number bit by bit if the value space is near to full) .This is a bit like a hashtable, you implement a static tree to find the final element.

Second, you could run an analysis on the binary code you want to execute. You could even do that per binary, before execution, and runtime-patch your emulator. This analysis would build a histogram representing the frequency of instructions, and then you would organize your tests so that the most frequent instructions get predicted correctly.

But I cant see this being faster than a medium 15 cycles penalty, unless you have 99% of MOV and you put an equality for the MOV opcode before the other tests.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow