سؤال

Why does the position of a function in a c++ file affect its performance? Specifically in the example given below we have two identical functions that have different, consistent performance profiles. How does one go about investigating this and determining why the performance is so different?

The example is pretty straightforward in that we have two functions: a and b. Each is run many times in a tight loop and optimised (-O3 -march=corei7-avx) and timed. Here is the code:

#include <cstdint>
#include <iostream>
#include <numeric>

#include <boost/timer/timer.hpp>

bool array[] = {true, false, true, false, false, true};

uint32_t __attribute__((noinline)) a() {
    asm("");
    return std::accumulate(std::begin(array), std::end(array), 0);
}

uint32_t __attribute__((noinline)) b() {
    asm("");
    return std::accumulate(std::begin(array), std::end(array), 0);
}

const size_t WARM_ITERS = 1ull << 10;
const size_t MAX_ITERS = 1ull << 30;

void test(const char* name, uint32_t (*fn)())
{
    std::cout << name << ": ";
    for (size_t i = 0; i < WARM_ITERS; i++) {
        fn();
        asm("");
    }
    boost::timer::auto_cpu_timer t;
    for (size_t i = 0; i < MAX_ITERS; i++) {
        fn();
        asm("");
    }
}

int main(int argc, char **argv)
{
    test("a", a);
    test("b", b);
    return 0;
}

Some notable features:

  • Function a and b are identical. They perform the same accumulate operation and compile down to the same assembly instructions.
  • Each test iteration has a warm up period before the timing starts to try and eliminate any issues with warming up caches.

When this is compiled and run we get the following output showing a is significantly slower than b:

[me@host:~/code/mystery] make && ./mystery 
g++-4.8 -c -g -O3 -Wall -Wno-unused-local-typedefs -std=c++11 -march=corei7-avx -I/usr/local/include/boost-1_54/ mystery.cpp -o mystery.o
g++-4.8  mystery.o -lboost_system-gcc48-1_54 -lboost_timer-gcc48-1_54 -o mystery
a:  7.412747s wall, 7.400000s user + 0.000000s system = 7.400000s CPU (99.8%)
b:  5.729706s wall, 5.740000s user + 0.000000s system = 5.740000s CPU (100.2%)

If we invert the two tests (i.e. call test(b) and then test(a)) a is still slower than b:

[me@host:~/code/mystery] make && ./mystery 
g++-4.8 -c -g -O3 -Wall -Wno-unused-local-typedefs -std=c++11 -march=corei7-avx -I/usr/local/include/boost-1_54/ mystery.cpp -o mystery.o
g++-4.8  mystery.o -lboost_system-gcc48-1_54 -lboost_timer-gcc48-1_54 -o mystery
b:  5.733968s wall, 5.730000s user + 0.000000s system = 5.730000s CPU (99.9%)
a:  7.414538s wall, 7.410000s user + 0.000000s system = 7.410000s CPU (99.9%)

If we now invert the location of the functions in the C++ file (move the definition of b above a) the results are inverted and a becomes faster than b!

[me@host:~/code/mystery] make && ./mystery 
g++-4.8 -c -g -O3 -Wall -Wno-unused-local-typedefs -std=c++11 -march=corei7-avx -I/usr/local/include/boost-1_54/ mystery.cpp -o mystery.o
g++-4.8  mystery.o -lboost_system-gcc48-1_54 -lboost_timer-gcc48-1_54 -o mystery
a:  5.729604s wall, 5.720000s user + 0.000000s system = 5.720000s CPU (99.8%)
b:  7.411549s wall, 7.420000s user + 0.000000s system = 7.420000s CPU (100.1%)

So essentially whichever function is at the top of the c++ file is slower.

Some answers to questions you may have:

  • The code compiled is identical for both a and b. The disassembly has been checked. (For those interested: http://pastebin.com/2QziqRXR)
  • The code was compiled using gcc 4.8, gcc 4.8.1 on ubuntu 13.04, ubuntu 13.10, and ubuntu 12.04.03.
  • Effects observed on an Intel Sandy Bridge i7-2600 and Intel Xeon X5482 cpus.

Why would this be happening? What tools are available to investigate something like this?

هل كانت مفيدة؟

المحلول

It looks to me like it's a cache aliasing issue.

The test case is quite clever, and correctly loads everything into cache before timing it. It looks like everything fits in cache - though simulated, I've verified this by looking at the output of valgrind's cachegrind tool, and as one would expect in such a small test case, there are no significant cache misses:

valgrind --tool=cachegrind --I1=32768,8,64 --D1=32768,8,64  /tmp/so
==11130== Cachegrind, a cache and branch-prediction profiler
==11130== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==11130== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==11130== Command: /tmp/so
==11130== 
--11130-- warning: L3 cache found, using its data for the LL simulation.
a:  6.692648s wall, 6.670000s user + 0.000000s system = 6.670000s CPU (99.7%)
b:  7.306552s wall, 7.280000s user + 0.000000s system = 7.280000s CPU (99.6%)
==11130== 
==11130== I   refs:      2,484,996,374
==11130== I1  misses:            1,843
==11130== LLi misses:            1,694
==11130== I1  miss rate:          0.00%
==11130== LLi miss rate:          0.00%
==11130== 
==11130== D   refs:        537,530,151  (470,253,428 rd   + 67,276,723 wr)
==11130== D1  misses:           14,477  (     12,433 rd   +      2,044 wr)
==11130== LLd misses:            8,336  (      6,817 rd   +      1,519 wr)
==11130== D1  miss rate:           0.0% (        0.0%     +        0.0%  )
==11130== LLd miss rate:           0.0% (        0.0%     +        0.0%  )
==11130== 
==11130== LL refs:              16,320  (     14,276 rd   +      2,044 wr)
==11130== LL misses:            10,030  (      8,511 rd   +      1,519 wr)
==11130== LL miss rate:            0.0% (        0.0%     +        0.0%  )

I picked a 32k, 8 way associative cache with a 64 byte cache line size to match common Intel CPUs, and saw the same discrepancy between the a and b function repeatedly.

Running on an imaginary machine with a 32k, 128 way associative cache with the same cache line size though, that difference all but goes away:

valgrind --tool=cachegrind --I1=32768,128,64 --D1=32768,128,64  /tmp/so
==11135== Cachegrind, a cache and branch-prediction profiler
==11135== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==11135== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==11135== Command: /tmp/so
==11135== 
--11135-- warning: L3 cache found, using its data for the LL simulation.
a:  6.754838s wall, 6.730000s user + 0.010000s system = 6.740000s CPU (99.8%)
b:  6.827246s wall, 6.800000s user + 0.000000s system = 6.800000s CPU (99.6%)
==11135== 
==11135== I   refs:      2,484,996,642
==11135== I1  misses:            1,816
==11135== LLi misses:            1,718
==11135== I1  miss rate:          0.00%
==11135== LLi miss rate:          0.00%
==11135== 
==11135== D   refs:        537,530,207  (470,253,470 rd   + 67,276,737 wr)
==11135== D1  misses:           14,297  (     12,276 rd   +      2,021 wr)
==11135== LLd misses:            8,336  (      6,817 rd   +      1,519 wr)
==11135== D1  miss rate:           0.0% (        0.0%     +        0.0%  )
==11135== LLd miss rate:           0.0% (        0.0%     +        0.0%  )
==11135== 
==11135== LL refs:              16,113  (     14,092 rd   +      2,021 wr)
==11135== LL misses:            10,054  (      8,535 rd   +      1,519 wr)
==11135== LL miss rate:            0.0% (        0.0%     +        0.0%  )

Since in an 8 way cache, there are fewer spaces where potentially aliasing functions can hide, you get the addressing equivalent of more hash collisions. With the machine that has different cache associativity, in this instance you luck out with where things get placed in the object file, and so though not a cache miss, you also don't have to do any work to resolve which cache line you actually need.

Edit: more on cache associativity: http://en.wikipedia.org/wiki/CPU_cache#Associativity


Another edit: I've confirmed this with hardware event monitoring through the perf tool.

I modified the source to call only a() or b() depending on whether there was a command line argument present. The timings are the same as in the original test case.

sudo perf record -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-loads,iTLB-load-misses /tmp/so
a:  6.317755s wall, 6.300000s user + 0.000000s system = 6.300000s CPU (99.7%)
sudo perf report 

4K dTLB-loads
97 dTLB-load-misses
4K dTLB-stores
7 dTLB-store-misses
479 iTLB-loads
142 iTLB-load-misses               

whereas

sudo perf record -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-loads,iTLB-load-misses /tmp/so foobar
b:  4.854249s wall, 4.840000s user + 0.000000s system = 4.840000s CPU (99.7%)
sudo perf report 

3K dTLB-loads
87 dTLB-load-misses
3K dTLB-stores
19 dTLB-store-misses
259 iTLB-loads
93 iTLB-load-misses

Showing that b has less TLB action, and so the cache doesn't have to be evicted. Given that the functionality between the two is otherwise identical, it can only be explained by aliasing.

نصائح أخرى

You are calling a and b from test. Since the compiler has no reason to reorder your two functions a is further away that b (in the original) from test. You are also using templates so the actual code generation is is quite a bit bigger than what it looks in the C++ source.

It is therefore quite possible that the instruction memory for b gets into the instruction cache together with test, a being further away does not get into the cache and therefore take longer to fetch from lower down caches or CPU main memory that b.

It is therefore possible that because of longer instruction fetch cycles for a than b, a runs slower than b even though the actual code is the same, it is just further away.

Certain CPU architectures (such as arm cortex-A series) support performance counters that count the number of cache misses. Tools like perf, can capture this data when set to work with the appropriate performance counters.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top