Difficulties to measure C/C++ performance

Question 1

With -O1, gcc-4.7.1 calls unpredictableIfs only once and resuses the result, since it recognizes that it's a pure function, so the result will be the same every time it's called. (Mine did, verified by looking at the generated assembly.)

With higher optimisation level, the functions are inlined, and the compiler doesn't recognize that it's the same code anymore, so it is run each time a function call appears in the source.

Apart from that, my gcc-4.7.1 deals best with unpredictableIfs when using -O1 or -O2 (apart from the reuse issue, both produce the same code), while noIfs is treated much better with -O3. The timings between the different runs of the same code are however consistent here - equal or differing by 10 milliseconds (granularity of clock), so I have no idea what could cause the substantially different times for unpredictableIfs you reported for -O3.

With -O2, the loop for unpredictableIfs is identical to the code generated with -O1 (except for register swapping):

.L12:
    movl    %eax, %ecx
    andl    $1073741826, %ecx
    cmpl    $1, %ecx
    adcl    $0, %edx
    addl    $1, %eax
    cmpl    $1000000000, %eax
    jne .L12

and for noIfs it's similar:

.L15:
    xorl    %ecx, %ecx
    testl   $1073741826, %eax
    sete    %cl
    addl    $1, %eax
    addl    %ecx, %edx
    cmpl    $1000000000, %eax
    jne .L15

where it was

.L7:
    testl   $1073741826, %edx
    sete    %cl
    movzbl  %cl, %ecx
    addl    %ecx, %eax
    addl    $1, %edx
    cmpl    $1000000000, %edx
    jne .L7

with -O1. Both loops run in similar time, with unpredictableIfs a bit faster.

With -O3, the loop for unpredictableIfs becomes worse,

.L14:
    leal    1(%rdx), %ecx
    testl   $1073741826, %eax
    cmove   %ecx, %edx
    addl    $1, %eax
    cmpl    $1000000000, %eax
    jne     .L14

and for noIfs (including the setup-code here), it becomes better:

    pxor    %xmm2, %xmm2
    movq    %rax, 32(%rsp)
    movdqa  .LC3(%rip), %xmm6
    xorl    %eax, %eax
    movdqa  .LC2(%rip), %xmm1
    movdqa  %xmm2, %xmm3
    movdqa  .LC4(%rip), %xmm5
    movdqa  .LC5(%rip), %xmm4
    .p2align 4,,10
    .p2align 3
.L18:
    movdqa  %xmm1, %xmm0
    addl    $1, %eax
    paddd   %xmm6, %xmm1
    cmpl    $250000000, %eax
    pand    %xmm5, %xmm0
    pcmpeqd %xmm3, %xmm0
    pand    %xmm4, %xmm0
    paddd   %xmm0, %xmm2
    jne .L18

.LC2:
    .long   0
    .long   1
    .long   2
    .long   3
    .align 16
.LC3:
    .long   4
    .long   4
    .long   4
    .long   4
    .align 16
.LC4:
    .long   1073741826
    .long   1073741826
    .long   1073741826
    .long   1073741826
    .align 16
.LC5:
    .long   1
    .long   1
    .long   1
    .long   1

it computes four iterations at once, and accordingly, noIfs runs almost four times as fast then.

Question 2

Right, looking at the assembler code from gcc on 64-bit Linux, the first case, with -O1, the function UnpredictableIfs is indeed called only once, and the result reused.

With -O2 and -O3, the functions are inlined, and the time it takes should be identical. There is also no actual branches in either bit of code, but the translation for the two bits of code is somewhat different, I've cut out the lines that update "sum" [in %edx in both cases]

UnpredictableIfs:

movl    %eax, %ecx
andl    $1073741826, %ecx
cmpl    $1, %ecx
adcl    $0, %edx
addl    $1, %eax

NoIfs:

xorl    %ecx, %ecx
testl   $1073741826, %eax
sete    %cl
addl    $1, %eax
addl    %ecx, %edx

As you can see, it's not quite identical, but it does very similar things.

Question 3

Regarding the range of results on Windows (from 1016 ms to 4797 ms): You should know that clock() in MSVC returns elapsed wall time. The standard says clock() should return an approximation of CPU time spent by the process, and other implementations do a better job of this.

Given that MSVC is giving wall time, if your process got pre-empted while running one iteration of the test, it could give a much larger result, even if the code ran in approximately the same amount of CPU time.

Also note that clock() on many Windows PCs has a pretty lousy resolution, often like 11-19 ms. You've done enough iterations that that's only about 1%, so I don't think it's part of the discrepancy, but it's good to be aware of when trying to write a benchmark. I understand you're going for portability, but if you needed a better measurement on Windows, you can use QueryPerformanceCounter which will almost certainly give you much better resolution, though it's still just elapsed wall time.

UPDATE: After I learned that the long runtime on the one case was happening consistently, I fired up VS2010 and reproduced the results. I was typically getting something around 1000 ms for some runs, 750 ms for others, and 5000+ ms for the inexplicable ones.

Observations:

In all cases the unpredictableIfs() code was inlined.
Removing the noIfs() code had no impact (so the long time wasn't a side effect of that code).
Setting thread affinity to a single processor had no effect.
The 5000 ms times were invariably the later instances. I noted that the later instances had an extra instruction before the beginning of the loop: lea ecx,[ecx]. I don't see why that should make a 5x difference. Other than that the early and later instances were identical code.
Removing the volatile from the start and stop variables yielded fewer long runs, more of the 750 ms runs, and no 1000 ms runs. (The generated loop code looks exactly the same in all cases now, not leas.)
Removing the volatile from the sum variable (but keeping it for the clock timers), the long runs can happen in any position.
If you remove all of the volatile qualifiers, you get consistent, fast (750 ms) runs. (The code looks identical to the earlier ones, but edi was chosen for sum instead of ecx.)

I'm not sure what to conclude from all this, except that volatile has unpredictable performance consequences with MSVC, so you should apply it only when necessary.

UPDATE 2: I'm seeing consistent runtime differences tied to the use of volatile, even though the disassembly is almost identical.

With volatile:

Puzzling measurements:
Unpredictable ifs took 643 msec; answer was 500000000
Unpredictable ifs took 1248 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 4611 msec; answer was 500000000
Unpredictable ifs took 4706 msec; answer was 500000000
Unpredictable ifs took 4516 msec; answer was 500000000
Unpredictable ifs took 4382 msec; answer was 500000000

The disassembly for each instance looks like this:

    start = clock();
010D1015  mov         esi,dword ptr [__imp__clock (10D20A0h)]  
010D101B  add         esp,4  
010D101E  call        esi  
010D1020  mov         dword ptr [start],eax  
    sum = unpredictableIfs();
010D1023  xor         ecx,ecx  
010D1025  xor         eax,eax  
010D1027  test        eax,40000002h  
010D102C  jne         main+2Fh (10D102Fh)  
010D102E  inc         ecx  
010D102F  inc         eax  
010D1030  cmp         eax,3B9ACA00h  
010D1035  jl          main+27h (10D1027h)  
010D1037  mov         dword ptr [sum],ecx  
    stop = clock();
010D103A  call        esi  
010D103C  mov         dword ptr [stop],eax

Without volatile:

Puzzling measurements:
Unpredictable ifs took 644 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000

    start = clock();
00321014  mov         esi,dword ptr [__imp__clock (3220A0h)]  
0032101A  add         esp,4  
0032101D  call        esi  
0032101F  mov         dword ptr [start],eax  
    sum = unpredictableIfs();
00321022  xor         ebx,ebx  
00321024  xor         eax,eax  
00321026  test        eax,40000002h  
0032102B  jne         main+2Eh (32102Eh)  
0032102D  inc         ebx  
0032102E  inc         eax  
0032102F  cmp         eax,3B9ACA00h  
00321034  jl          main+26h (321026h)  
    stop = clock();
00321036  call        esi
// The only optimization I see is here, where eax isn't explicitly stored
// in stop but is instead immediately used to compute the value for the
// printf that follows.

Other than register selection, I don't see a significant difference.