سؤال

I'm trying to get a good understanding of branch prediction by measuring the time to run loops with predictable branches vs. loops with random branches.

So I wrote a program that takes large arrays of 0's and 1's arranged in different orders (i.e. all 0's, repeating 0-1, all rand), and iterates through the array branching based on if the current index is 0 or 1, doing time-wasting work.

I expected that harder-to-guess arrays would take longer to run on, since the branch predictor would guess wrong more often, and that the time-delta between runs on two sets of arrays would remain the same regardless of the amount of time-wasting work.

However, as amount of time-wasting work increased, the difference in time-to-run between arrays increased, A LOT.

Yo this graph makes no sense

(X-axis is amount of time-wasting work, Y-axis is time-to-run)

Does anyone understand this behavior? You can see the code I'm running at the following code:

#include <stdlib.h>
#include <time.h>
#include <chrono>
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
static const int s_iArrayLen = 999999;
static const int s_iMaxPipelineLen = 60;
static const int s_iNumTrials = 10;

int doWorkAndReturnMicrosecondsElapsed(int* vals, int pipelineLen){
        int* zeroNums = new int[pipelineLen];
        int* oneNums = new int[pipelineLen];
        for(int i = 0; i < pipelineLen; ++i)
                zeroNums[i] = oneNums[i] = 0;

        chrono::time_point<chrono::system_clock> start, end;
        start = chrono::system_clock::now();
        for(int i = 0; i < s_iArrayLen; ++i){
                if(vals[i] == 0){
                        for(int i = 0; i < pipelineLen; ++i)
                                ++zeroNums[i];
                }
                else{
                        for(int i = 0; i < pipelineLen; ++i)
                                ++oneNums[i];
                }
        }
        end = chrono::system_clock::now();
        int elapsedMicroseconds = (int)chrono::duration_cast<chrono::microseconds>(end-start).count();

        //This should never fire, it just exists to guarantee the compiler doesn't compile out our zeroNums/oneNums
        for(int i = 0; i < pipelineLen - 1; ++i)
                if(zeroNums[i] != zeroNums[i+1] || oneNums[i] != oneNums[i+1])
                        return -1;
        delete[] zeroNums;
        delete[] oneNums;
        return elapsedMicroseconds;
}

struct TestMethod{
        string name;
        void (*func)(int, int&);
        int* results;

        TestMethod(string _name, void (*_func)(int, int&)) { name = _name; func = _func; results = new int[s_iMaxPipelineLen]; }
};

int main(){
        srand( (unsigned int)time(nullptr) );

        vector<TestMethod> testMethods;
        testMethods.push_back(TestMethod("all-zero", [](int index, int& out) { out = 0; } ));
        testMethods.push_back(TestMethod("repeat-0-1", [](int index, int& out) { out = index % 2; } ));
        testMethods.push_back(TestMethod("repeat-0-0-0-1", [](int index, int& out) { out = (index % 4 == 0) ? 0 : 1; } ));
        testMethods.push_back(TestMethod("rand", [](int index, int& out) { out = rand() % 2; } ));

        int* vals = new int[s_iArrayLen];

        for(int currentPipelineLen = 0; currentPipelineLen < s_iMaxPipelineLen; ++currentPipelineLen){
                for(int currentMethod = 0; currentMethod < (int)testMethods.size(); ++currentMethod){
                        int resultsSum = 0;
                        for(int trialNum = 0; trialNum < s_iNumTrials; ++trialNum){
                                //Generate a new array...
                                for(int i = 0; i < s_iArrayLen; ++i)  
                                        testMethods[currentMethod].func(i, vals[i]);

                                //And record how long it takes
                                resultsSum += doWorkAndReturnMicrosecondsElapsed(vals, currentPipelineLen);
                        }

                        testMethods[currentMethod].results[currentPipelineLen] = (resultsSum / s_iNumTrials);
                }
        }

        cout << "\t";
        for(int i = 0; i < s_iMaxPipelineLen; ++i){
                cout << i << "\t";
        }
        cout << "\n";
        for (int i = 0; i < (int)testMethods.size(); ++i){
                cout << testMethods[i].name.c_str() << "\t";
                for(int j = 0; j < s_iMaxPipelineLen; ++j){
                        cout << testMethods[i].results[j] << "\t";
                }
                cout << "\n";
        }
        int end;
        cin >> end;
        delete[] vals;
}

Pastebin link: http://pastebin.com/F0JAu3uw

هل كانت مفيدة؟

المحلول

I think you may be measuring the cache/memory performance, more than the branch prediction. Your inner 'work' loop is accessing an ever increasing chunk of memory. Which may explain the linear growth, the periodic behaviour, etc.

I could be wrong, as I've not tried replicating your results, but if I were you I'd factor out memory accesses before timing other things. Perhaps sum one volatile variable into another, rather than working in an array.

Note also that, depending on the CPU, the branch prediction can be a lot smarter than just recording the last time a branch was taken - repeating patterns, for example, aren't as bad as random data.

Ok, a quick and dirty test I knocked up on my tea break which tried to mirror your own test method, but without thrashing the cache, looks like this:

enter image description here

Is that more what you expected?

If I can spare any time later there's something else I want to try, as I've not really looked at what the compiler is doing...

Edit:

And, here's my final test - I recoded it in assembler to remove the loop branching, ensure an exact number of instructions in each path, etc.

More branch prediction results

I also added an extra case, of a 5-bit repeating pattern. It seems pretty hard to upset the branch predictor on my ageing Xeon.

نصائح أخرى

In addition to what JasonD pointed out, I would also like to note that there are conditions inside for loop, which may affect branch predictioning:

if(vals[i] == 0)
{
    for(int i = 0; i < pipelineLen; ++i)
        ++zeroNums[i];
}

i < pipelineLen; is a condition like your ifs. Of course compiler may unroll this loop, however pipelineLen is argument passed to a function so probably it does not.

I'm not sure if this can explain wavy pattern of your results, but:

Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:

Read full article: http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts

So your loops are not only measuring memory throughput but they are also affecting BTB.

If you have passed 0-1 pattern in your list but then executed a for loop with pipelineLen = 2 your BTB will be filled with something like 0-1-1-0 - 1-1-1-0 - 0-1-1-0 - 1-1-1-0 and then it will start to overlap, so this can indeed explain wavy pattern of your results (some overlaps will be more harmful than others).

Take this as an example of what may happen rather than literal explanation. Your CPU may have much more sophisticated branch prediction architecture.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top