Modern CPU Inner Loop Indirection Optimizations

https://stackoverflow.com/questions/3487937

28-09-2019
|

Question

From http://www.boost.org/community/implementation_variations.html

"... coding differences such as changing a class from virtual to non-virtual members or removing a level of indirection are unlikely to make any measurable difference unless deep in an inner loop. And even in an inner loop, modern CPUs often execute such competing code sequences in the same number of clock cycles!"

I am trying to understand the "even in the inner loop" part. Specifically what mechanisms do CPUs implement to execute the two codes (virtual vs non-virtual or an additional level of indirection) within the same number of clock cycles? I know about instruction pipelining and caching, but how is it possible to perform a virtual call within the same number of clock cycles as a non-virtual call? How is the indirection "lost"?

Solution

Caching (e.g. branch target caching), parallel load units (part of pipelining, but also things like "hit under miss" which don't stall the pipeline), and out-of-order execution are likely to help transform a load-load-branch into something that is closer to a fixed branch. Instruction folding/elimination (what's the proper term for this?) in the decode or branch prediction stage of the pipeline may also contribute.

All of this relies on a lot of different things, though: how many different branch targets there are (e.g. how many different virtual overloads are you likely to trigger), how many things you loop over (is the branch target cache "warm"? how about the icache/dcache?), how the virtual tables or indirection tables are layed out in memory (are they cache-friendly, or is each new vtable load possibly evicting an old vtable?), is the cache being invalidated repeatedly due to multicore ping-ponging, etc...

(Disclaimer: I'm definitely not an expert here, and a lot of my knowledge comes from studying in-order embedded processors, so some of this is extrapolation. If you have corrections, feel free to comment!)

The correct way to determine if it's going to be a problem for a specific program is of course to profile. If you can, do so with the help of hardware counters -- they can tell you a lot about what's going on in the various stages of the pipeline.

Edit:

As Hans Passant points out in an above comment Modern CPU Inner Loop Indirection Optimizations, the key to getting these two things to take the same amount of time is the ability to effectively "retire" more than one instruction per cycle. Instruction elimination can help with this, but superscalar design is probably more important (hit under miss is a very small and specific example, fully redundant load units might be a better one).

Let's take an ideal situation, and assume a direct branch is just one instruction:

branch dest

...and an indirect branch is three (maybe you can get it in two, but it's greater than one):

load vtable from this
load dest from vtable
branch dest

Let's assume an absolutely perfect situation: *this and the entire vtable are in L1 cache, L1 cache is fast enough to support amortized one cycle per instruction cost for the two loads. (You can even assume the processor reordered the loads and intermixed them with earlier instructions to allow time for them to complete before the branch; it doesn't matter for this example.) Also assume the branch target cache is hot, and there's no pipeline flush cost for the branch, and the branch instruction comes down to a single cycle (amortized).

The theoretical minimum time for the first example is therefore 1 cycle (amortized).

The theoretical minimum for the second example, absent instruction elimination or redundant functional units or something that will allow retiring more than one instruction per cycle, is 3 cycles (there are 3 instructions)!

The indirect load will always be slower, because there are more instructions, until you reach into something like superscalar design that allows retiring more than one instruction per cycle.

Once you have this, the minimum for both examples becomes something between 0 and 1 cycles, again, provided everything else is ideal. Arguably you have to have more ideal circumstances for the second example to actually reach that theoretical minimum than for the first example, but it's now possible.

In some of the cases you'd care about, you're probably not going to reach that minimum for either example. Either the branch target cache will be cold, or the vtable won't be in the data cache, or the machine won't be capable of reordering the instructions to take full advantage of the redundant functional units.

...this is where profiling comes in, which is generally a good idea anyway.

You can just espouse a slight paranoia about virtuals in the first place. See Noel Llopis's article on data oriented design, the excellent Pitfalls of Object-Oriented Programming slides, and Mike Acton's grumpy-yet-educational presentations. Now you've suddenly moved into patterns that the CPU is already likely to be happy with, if you're processing a lot of data.

High level language features like virtual are usually a tradeoff between expressiveness and control. I honestly think, though, by just increasing your awareness of what virtual is actually doing (don't be afraid to read the disassembly view from time to time, and definitely peek at your CPU's architecture manuals), you'll tend to use it when it makes sense and not when it doesn't, and a profiler can cover the rest if needed.

One-size-fits-all statements about "don't use virtual" or "virtual use is unlikely to make a measurable difference" make me grouchy. The reality is usually more complicated, and either you're going to be in a situation where you care enough to profile or avoid it, or you're in that other 95% where it's probably not worth caring except for the possible educational content.

OTHER TIPS

Pipelining is the main way.

It might take 20 clock cycles to load an instruction, decode it, perform it's actions and load indirect memory references. But due to the pipleline the processor can be executing parts of 19 other instructions at the same time in different stages of the pipeline giving an overall throughput of 1 instruction every clock cycle regardless of how long it actually takes to feed that instruction through the pipeline.

What happens, I think is that the processor has a special cache which holds the locations and targets of branches and indirect jumps. If an indirect jump is encountered at $12345678, and the last time it was encountered it went to address $12348765, the processor can start speculative execution of the instructions at address $12348765 even before it resolves the address of the branch. In many cases, within the inner loop of a function, a particular indirect jump will always jump to the same address throughout the duration of the loop. The indirect-jump cache can thus avoid branching penalties.

Modern CPUs use an adaptive branch prediction technique which can predict many indirect jumps such as you get with a vtable implementation of virtual functions. See http://en.wikipedia.org/wiki/Branch_prediction#Prediction_of_indirect_jumps

If the CPU already has the memory address in cache, then executing a load instruction is trivial, if that.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow