Pergunta

I have this C++ which creates two derived objects and then invokes a virtual function call many times:

Parent* d;
Child1 d1[1];
Child2 d2[1];

if(__rdtsc() & 1 != 0){
    d = d1;
}
else{
    d = d2;
}

for(unsigned long long i =0; i<9000000000; ++i){
    sum += d->process2();
}

and it produces this assembler:

        for(unsigned long long i =0; i<9000000000; ++i){
000000013F4241A5  mov         qword ptr [rsp+100h],0  
000000013F4241B1  jmp         main+2B6h (013F4241C6h)  
000000013F4241B3  mov         rax,qword ptr [rsp+100h]  
000000013F4241BB  inc         rax  
000000013F4241BE  mov         qword ptr [rsp+100h],rax  
000000013F4241C6  mov         rax,218711A00h  
000000013F4241D0  cmp         qword ptr [rsp+100h],rax  
000000013F4241D8  jae         main+306h (013F424216h)  
            sum += d->process2();
000000013F4241DA  mov         rax,qword ptr [rsp+0F8h]  
000000013F4241E2  mov         rax,qword ptr [rax]  
000000013F4241E5  mov         rcx,qword ptr [rsp+0F8h]  
000000013F4241ED  call        qword ptr [rax+8]  
000000013F4241F0  mov         qword ptr [rsp+1D8h],rax  
000000013F4241F8  mov         rax,qword ptr [rsp+1D8h]  
000000013F424200  mov         rcx,qword ptr [sum (013F4385D8h)]  
000000013F424207  add         rcx,rax  
000000013F42420A  mov         rax,rcx  
000000013F42420D  mov         qword ptr [sum (013F4385D8h)],rax  
        }

Based on the assembler could somebody please confirm that the compiler cannot optimize the virtual call in the loop (even though every iteration calls the same derived object) because the compiler cannot possibly know whether d1 or d2 was selected, due to the call to __rdtsc() only being resolvable at run-time?

(If anyone could give me some advice how to read the assembler for the d->process2() call it would be most appreciated)

Foi útil?

Solução

000000013F4241DA  mov     rax,qword ptr [rsp+0F8h] //load "this" into rax
000000013F4241E2  mov     rax,qword ptr [rax]      //load vtable pointer
000000013F4241E5  mov     rcx,qword ptr [rsp+0F8h] //load "this" into rcx
000000013F4241ED  call    qword ptr [rax+8]        //call second entry in vtable?

Clearly, call to virtual function is not optimized away. You are right, it is because of random factor.

Outras dicas

The compiler would not be able to inline the virtual call because, yes, it would be unable to know which object d1 or d2 would be used, therefore allowing for two possible inline results. Additionally, as a virtual call, there may be additional overhead for vtable lookups.

My recommendation, if you want to try to optimize yourself, would be to instead write something similar to

if(__rdtsc() & 1 != 0){
    for(unsigned long long i =0; i<9000000000; ++i){
        sum += d1[0].process2();
    }
}
else{
    for(unsigned long long i =0; i<9000000000; ++i){
        sum += d2[0].process2();
    }
}

though this still may be unable to optimize if process2 is a virtual call, and there is always a chance that inlining will not occur.

All in all, virtual calls always add overhead, and if clock cycles are important, it might do well to avoid. You may look into Static Polymorphism which loses some flexibility but can transfer costs from runtime to compile time.

Edit in response to user997112 below: Static Polymorphism doesn't work for exactly the situation described above, but could be used to simplify my example a bit, but putting the for loop in a function:

void iterate_a_bunch( Parent<Child> &f )
{
    for(unsigned long long i =0; i<9000000000; ++i){
      f.process2();
    }
}

This function would compile twice, once for Child1 and once for Child2, leading to larger code size, but potentially boosted runtimes.

I did the experiment with g++ and with -O3 it looks optimized. However, I have to say that the optimization looks exactly the same as what dvntehn00bz offers in his answer.

400860:       48 8b 45 00             mov    0x0(%rbp),%rax
400864:       48 89 ef                mov    %rbp,%rdi
400867:       ff 50 10                callq  *0x10(%rax)
40086a:       48 83 eb 01             sub    $0x1,%rbx
40086e:       75 f0                   jne    400860 <main+0x40>

The whole code has two loops, completely automatically.

P.S. Of course, your classes are probably a lot more complex than my slim version so the optimization is certainly much easier for the compiler.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top