How to correctly benchmark a [templated] C++ program

https://stackoverflow.com/questions/435627

22-07-2019
|

Question

< backgound>

I'm at a point where I really need to optimize C++ code. I'm writing a library for molecular simulations and I need to add a new feature. I already tried to add this feature in the past, but I then used virtual functions called in nested loops. I had bad feelings about that and the first implementation proved that this was a bad idea. However this was OK for testing the concept.

< /background>

Now I need this feature to be as fast as possible (well without assembly code or GPU calculation, this still has to be C++ and more readable than less). Now I know a little bit more about templates and class policies (from Alexandrescu's excellent book) and I think that a compile-time code generation may be the solution.

However I need to test the design before doing the huge work of implementing it into the library. The question is about the best way to test the efficiency of this new feature.

Obviously I need to turn optimizations on because without this g++ (and probably other compilers as well) would keep some unnecessary operations in the object code. I also need to make a heavy use of the new feature in the benchmark because a delta of 1e-3 second can make the difference between a good and a bad design (this feature will be called million times in the real program).

The problem is that g++ is sometimes "too smart" while optimizing and can remove a whole loop if it consider that the result of a calculation is never used. I've already seen that once when looking at the output assembly code.

If I add some printing to stdout, the compiler will then be forced to do the calculation in the loop but I will probably mostly benchmark the iostream implementation.

So how can I do a correct benchmark of a little feature extracted from a library ? Related question: is it a correct approach to do this kind of in vitro tests on a small unit or do I need the whole context ?

Thanks for advices !

There seem to be several strategies, from compiler-specific options allowing fine tuning to more general solutions that should work with every compiler like volatile or extern.

I think I will try all of these. Thanks a lot for all your answers!

Solution

If you want to force any compiler to not discard a result, have it write the result to a volatile object. That operation cannot be optimized out, by definition.

template<typename T> void sink(T const& t) {
   volatile T sinkhole = t;
}

No iostream overhead, just a copy that has to remain in the generated code. Now, if you're collecting results from a lot of operations, it's best not to discard them one by one. These copies can still add some overhead. Instead, somehow collect all results in a single non-volatile object (so all individual results are needed) and then assign that result object to a volatile. E.g. if your individual operations all produce strings, you can force evaluation by adding all char values together modulo 1<<32. This adds hardly any overhead; the strings will likely be in cache. The result of the addition will subsequently be assigned-to-volatile so each char in each sting must in fact be calculated, no shortcuts allowed.

OTHER TIPS

Unless you have a really aggressive compiler (can happen), I'd suggest calculating a checksum (simply add all the results together) and output the checksum.

Other than that, you might want to look at the generated assembly code before running any benchmarks so you can visually verify that any loops are actually being run.

Compilers are only allowed to eliminate code-branches that can not happen. As long as it cannot rule out that a branch should be executed, it will not eliminate it. As long as there is some data dependency somewhere, the code will be there and will be run. Compilers are not too smart about estimating which aspects of a program will not be run and don't try to, because that's a NP problem and hardly computable. They have some simple checks such as for if (0), but that's about it.

My humble opinion is that you were possibly hit by some other problem earlier on, such as the way C/C++ evaluates boolean expressions.

But anyways, since this is about a test of speed, you can check that things get called for yourself - run it once without, then another time with a test of return values. Or a static variable being incremented. At the end of the test, print out the number generated. The results will be equal.

To answer your question about in-vitro testing: Yes, do that. If your app is so time-critical, do that. On the other hand, your description hints at a different problem: if your deltas are in a timeframe of 1e-3 seconds, then that sounds like a problem of computational complexity, since the method in question must be called very, very often (for few runs, 1e-3 seconds is neglectible).

The problem domain you are modeling sounds VERY complex and the datasets are probably huge. Such things are always an interesting effort. Make sure that you absolutely have the right data structures and algorithms first, though, and micro-optimize all you want after that. So, I'd say look at the whole context first. ;-)

Out of curiosity, what is the problem you are calculating?

You have a lot of control on the optimizations for your compilation. -O1, -O2, and so on are just aliases for a bunch of switches.

From the man pages

       -O2 turns on all optimization flags specified by -O.  It also turns
       on the following optimization flags: -fthread-jumps -falign-func‐
       tions  -falign-jumps -falign-loops  -falign-labels -fcaller-saves
       -fcrossjumping -fcse-follow-jumps  -fcse-skip-blocks
       -fdelete-null-pointer-checks -fexpensive-optimizations -fgcse
       -fgcse-lm -foptimize-sibling-calls -fpeephole2 -fregmove -fre‐
       order-blocks  -freorder-functions -frerun-cse-after-loop
       -fsched-interblock  -fsched-spec -fschedule-insns  -fsched‐
       ule-insns2 -fstrict-aliasing -fstrict-overflow -ftree-pre
       -ftree-vrp

You can tweak and use this command to help you narrow down which options to investigate.

       ...
       Alternatively you can discover which binary optimizations are
       enabled by -O3 by using:

               gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
               gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
               diff /tmp/O2-opts /tmp/O3-opts Φ grep enabled

Once you find the culpret optimization you shouldn't need the cout's.

If this is possible for you, you might try splitting your code into:

the library you want to test compiled with all optimizations turned on
a test program, dinamically linking the library, with optimizations turned off

Otherwise, you might specify a different optimization level (it looks like you're using gcc...) for the test functio n with the optimize attribute (see http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function-Attributes).

You could create a dummy function in a separate cpp file that does nothing, but takes as argument whatever is the type of your calculation result. Then you can call that function with the results of your calculation, forcing gcc to generate the intermediate code, and the only penalty is the cost of invoking a function (which shouldn't skew your results unless you call it a lot!).

#include <iostream>

// Mark coords as extern.
// Compiler is now NOT allowed to optimise away coords
// This it can not remove the loop where you initialise it.
// This is because the code could be used by another compilation unit
extern double coords[500][3];
double coords[500][3];

int main()
{

//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
 {
   coords[i][0] = 3.23;
   coords[i][1] = 1.345;
   coords[i][2] = 123.998;
 }


std::cout << "hello world !"<< std::endl;
return 0;
}

edit: the easiest thing you can do is simply use the data in some spurious way after the function has run and outside your benchmarks. Like,

StartBenchmarking(); // ie, read a performance counter
for (int i=0; i<500; ++i)
 {
   coords[i][0] = 3.23;
   coords[i][1] = 1.345;
   coords[i][2] = 123.998;
 }
StopBenchmarking(); // what comes after this won't go into the timer

// this is just to force the compiler to use coords
double foo;
for (int j = 0 ; j < 500 ; ++j )
{
  foo += coords[j][0] + coords[j][1] + coords[j][2]; 
}
cout << foo;

What sometimes works for me in these cases is to hide the in vitro test inside a function and pass the benchmark data sets through volatile pointers. This tells the compiler that it must not collapse subsequent writes to those pointers (because they might be eg memory-mapped I/O). So,

void test1( volatile double *coords )
{
  //perform a simple initialization of all coordinates:
  for (int i=0; i<1500; i+=3)
  {
    coords[i+0] = 3.23;
    coords[i+1] = 1.345;
    coords[i+2] = 123.998;
  }
}

For some reason I haven't figured out yet it doesn't always work in MSVC, but it often does -- look at the assembly output to be sure. Also remember that volatile will foil some compiler optimizations (it forbids the compiler from keeping the pointer's contents in register and forces writes to occur in program order) so this is only trustworthy if you're using it for the final write-out of data.

In general in vitro testing like this is very useful so long as you remember that it is not the whole story. I usually test my new math routines in isolation like this so that I can quickly iterate on just the cache and pipeline characteristics of my algorithm on consistent data.

The difference between test-tube profiling like this and running it in "the real world" means you will get wildly varying input data sets (sometimes best case, sometimes worst case, sometimes pathological), the cache will be in some unknown state on entering the function, and you may have other threads banging on the bus; so you should run some benchmarks on this function in vivo as well when you are finished.

I don't know if GCC has a similar feature, but with VC++ you can use:

#pragma optimize

to selectively turn optimizations on/off. If GCC has similar capabilities, you could build with full optimization and just turn it off where necessary to make sure your code gets called.

Just a small example of an unwanted optimization:

#include <vector>
#include <iostream>

using namespace std;

int main()
{
double coords[500][3];

//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
 {
   coords[i][0] = 3.23;
   coords[i][1] = 1.345;
   coords[i][2] = 123.998;
 }


cout << "hello world !"<< endl;
return 0;
}

If you comment the code from "double coords[500][3]" to the end of the for loop it will generate exactly the same assembly code (just tried with g++ 4.3.2). I know this example is far too simple, and I wasn't able to show this behavior with a std::vector of a simple "Coordinates" structure.

However I think this example still shows that some optimizations can introduce errors in the benchmark and I wanted to avoid some surprises of this kind when introducing new code in a library. It's easy to imagine that the new context might prevent some optimizations and lead to a very inefficient library.

The same should also apply with virtual functions (but I don't prove it here). Used in a context where a static link would do the job I'm pretty confident that decent compilers should eliminate the extra indirection call for the virtual function. I can try this call in a loop and conclude that calling a virtual function is not such a big deal. Then I'll call it hundred of thousand times in a context where the compiler cannot guess what will be the exact type of the pointer and have a 20% increase of running time...

at startup, read from a file. in your code, say if(input == "x") cout<< result_of_benchmark;

The compiler will not be able to eliminate the calculation, and if you ensure the input is not "x", you won't benchmark the iostream.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow