How can a GCC instrumented executable be faster than the non-instrumented?

https://stackoverflow.com/questions/14610615

06-03-2022
|

Question

I'm benchmarking the overhead of GCC Profile-Guided Optimization on the SPEC benchmarks. I have some weird results with some benchmarks. Indeed, two of my benchmarks are running faster when instrumented.

The normal executable is compiled with: -g -O2 -march=native

The instrumented executable is compiled with: -g -O2 -march=native -fprofile-generate -fno-vpt

I'm using GCC 4.7 (The Google branch to be precise). The computer on which the benchmark is running has an Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz.

bwaves is a Fortran benchmark and libquantum

Here are the results:

bwaves-normal: 712.14 
bwaves-instrumented: 697.22 
  => ~2% faster 

libquantum-normal: 463.88
libquantum-instrumented: 449.05
  => ~3.2% faster

I ran the benchmarks several times thinking that it could be a problem on ma machine but each time I confirmed them.

I would understand a very small overhead on some programs, but I don't see any reason for an improvement.

So my question is: How can the GCC instrumented executable be faster than the optimized normal one ?

Thanks

Solution

I can think of two possibilities, both relating to cache.

One is that the counter increment "warms" some important cache lines. Second is that adding the structures required by instrumentation causes some heavily used arrays or variables to fall into different cache lines.

Another issue is that profiling / increasing a counter doesn't have to happen every time in a for loop -- if there's no 'break' or 'return' in a loop, a compiler is allowed to optimize the increment out of the loop.

OTHER TIPS

Looking at the GCC documentation, it looks like -fprofile-generate does activate some specific code transformations to make profiling easier/cheaper, so the instrumented code isn't really the original code + instrumentation. The changes could make the code faster, and adding code will also make the caching behaviour change. Hard to know without seeing the offending code. And from my (long ago) fooling around with LCC, when profiling is done intelligently it involves suprisingly little code changes.

Just curiosity: How does the code compiled taking the profile in consideration fare compared to the above?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow