Branch Prediction: Writing Code to Understand it; Getting Weird Results

Question 1

I think you may be measuring the cache/memory performance, more than the branch prediction. Your inner 'work' loop is accessing an ever increasing chunk of memory. Which may explain the linear growth, the periodic behaviour, etc.

I could be wrong, as I've not tried replicating your results, but if I were you I'd factor out memory accesses before timing other things. Perhaps sum one volatile variable into another, rather than working in an array.

Note also that, depending on the CPU, the branch prediction can be a lot smarter than just recording the last time a branch was taken - repeating patterns, for example, aren't as bad as random data.

Ok, a quick and dirty test I knocked up on my tea break which tried to mirror your own test method, but without thrashing the cache, looks like this:

enter image description here

Is that more what you expected?

If I can spare any time later there's something else I want to try, as I've not really looked at what the compiler is doing...

Edit:

And, here's my final test - I recoded it in assembler to remove the loop branching, ensure an exact number of instructions in each path, etc.

More branch prediction results

I also added an extra case, of a 5-bit repeating pattern. It seems pretty hard to upset the branch predictor on my ageing Xeon.

Question 2

In addition to what JasonD pointed out, I would also like to note that there are conditions inside for loop, which may affect branch predictioning:

if(vals[i] == 0)
{
    for(int i = 0; i < pipelineLen; ++i)
        ++zeroNums[i];
}

i < pipelineLen; is a condition like your ifs. Of course compiler may unroll this loop, however pipelineLen is argument passed to a function so probably it does not.

I'm not sure if this can explain wavy pattern of your results, but:

Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:

Read full article: http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts

So your loops are not only measuring memory throughput but they are also affecting BTB.

If you have passed 0-1 pattern in your list but then executed a for loop with pipelineLen = 2 your BTB will be filled with something like 0-1-1-0 - 1-1-1-0 - 0-1-1-0 - 1-1-1-0 and then it will start to overlap, so this can indeed explain wavy pattern of your results (some overlaps will be more harmful than others).

Take this as an example of what may happen rather than literal explanation. Your CPU may have much more sophisticated branch prediction architecture.