Question

I have a program that has at its heart a 2D array in the form of a

std::vector<std::vector< int > > grid

And there's a simple double for loop going on that goes somewhat like this:

for(int i=1; i<N-1; ++i)
    for(int j=1; j<N-1; ++j)
        sum += grid[i][j-1] + grid[i][j+1] + grid[i-1][j] + grid[i+1][j] + grid[i][j]*some_float;

With g++ -O3 it runs pretty fast, but for further optimization I profiled with callgrind and see a L1 Cache miss of about 37%, and 33% for LL which is a lot but not too surprising considering the random-ish nature of the computation. So I do a profile-guided optimization a la

g++ -fprofile-generate -O3 ...
./program
g++ -fprofile-use -O3 ...

and the program runs about 48% faster! But the puzzling part: The cache misses have even increased! L1 data cache miss is now 40%, LL same.

How can that be? There are no conditionals in the loop for which prediction could have been optimised and the cache misses are even higher. Yet it is faster.

edit: Alright, here's the sscce: http://pastebin.com/fLgskdQG . Play around with the N for different runtime. Compiled via

g++ -O3 -std=c++11 -sscce.cpp

on gcc 4.8.1 under linux.

profile-guided optimization with the commands above. The Callgrind stuff is done with a g++ -g switch and valgrind --tool=callgrind --simulate-cache=yes ./sscce

Was it helpful?

Solution

I noticed only one significant difference between assembly codes generated with or without PGO. Without PGO sum variable is spilled from register to memory, once per inner loop iteration. This writing variable to memory and loading it back might in theory slow down things very significantly. Fortunately modern processors optimize it with store-to-load forwarding, so that slowdown is not so big. Still Intel's optimization manual does not recommend to spill floating point variables to memory, especially when they are computed by long-latency operations, like floating point multiplication.

What is really puzzling here is why GCC needs PGO to avoid spilling register to memory. It is enough unused floating point registers, and even without PGO compiler could get all information necessary for proper optimization from single source file...

These unnecessary load/store operations explain not only why PGO code is faster, but also why it increases percentage of cache misses. Without PGO register is always spilled to the same location in memory, so this additional memory access increases both number of memory accesses and number of cache hits, while it does not change number of cache misses. With PGO we have less memory accesses but same amount of cache misses, so their percentage increases.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top