Profiling a function in c++

Question

Looking at cachegrind is not a good way to determine the performance of functions in isolation. Such tests are poor indication of how a function is going to perform in real usage on things like branch prediction and cache hit rates.

Cachegrind can often tell you WHY a given implementation is slow in practice, but you need to run it on a representative execution of your whole binary/data.

A couple of other things to note:

Raw dr/dw are not a good proxy of performance since they tell you nothing about your cache hit rate. All else being equal, a function that reads 1000 values from L1 cache is going to be faster than a function that reads a single value from memory that results in page fault, and having to load the page from virtual memory.

You're not going to see anything about failed branch predictions. All else being equal, a function that performs poorly on branch prediction is going to be much slower than an equivalent function structured to perform well on branch prediction.

edit ----------------------------------------------

Since you didn't give any details, I don't know what you're doing. But let's say you wrote tests like these:

for (int i = 0; i < 100000; ++i) {
   func1("test string");
}
for (int i = 0; i < 100000; ++i) {
   func2("test string");
}

This is not representative of how an actual program would use this function (hence what I said about using representative data). Because of that, this test is pretty worthless. It's a "microbenchmark". Everything fits in the cache the first pass through a function, and branch prediction should be far better than real world usage since you're always using the same input.

To write proper performance tests ask yourself "how would I use this function in my application", and write your tests to mimic that. Better yet, actually profile the function in your application. Don't have an application? Then you're optimizing way too soon (unless your interest is purely academic).

For reasons I pointed out in my original answer, a raw count of memory accesses is not going to tell you the performance of a function. Partly because not all memory accesses are created equally. Depending on WHERE the memory is being read from, there are orders of magnitude in difference in the access times ( Approximate cost to access various caches and main memory?). Beyond that, there is much more going on than just memory accesses. You could write a function that performed billions of operations entirely on what's stored in register.

You seem very focused on memory accesses... Have you tried reading up on profiling in general?