Has the compiler optimized-out the virtual call?

Question 1

000000013F4241DA  mov     rax,qword ptr [rsp+0F8h] //load "this" into rax
000000013F4241E2  mov     rax,qword ptr [rax]      //load vtable pointer
000000013F4241E5  mov     rcx,qword ptr [rsp+0F8h] //load "this" into rcx
000000013F4241ED  call    qword ptr [rax+8]        //call second entry in vtable?

Clearly, call to virtual function is not optimized away. You are right, it is because of random factor.

Question 2

The compiler would not be able to inline the virtual call because, yes, it would be unable to know which object d1 or d2 would be used, therefore allowing for two possible inline results. Additionally, as a virtual call, there may be additional overhead for vtable lookups.

My recommendation, if you want to try to optimize yourself, would be to instead write something similar to

if(__rdtsc() & 1 != 0){
    for(unsigned long long i =0; i<9000000000; ++i){
        sum += d1[0].process2();
    }
}
else{
    for(unsigned long long i =0; i<9000000000; ++i){
        sum += d2[0].process2();
    }
}

though this still may be unable to optimize if process2 is a virtual call, and there is always a chance that inlining will not occur.

All in all, virtual calls always add overhead, and if clock cycles are important, it might do well to avoid. You may look into Static Polymorphism which loses some flexibility but can transfer costs from runtime to compile time.

Edit in response to user997112 below: Static Polymorphism doesn't work for exactly the situation described above, but could be used to simplify my example a bit, but putting the for loop in a function:

void iterate_a_bunch( Parent<Child> &f )
{
    for(unsigned long long i =0; i<9000000000; ++i){
      f.process2();
    }
}

This function would compile twice, once for Child1 and once for Child2, leading to larger code size, but potentially boosted runtimes.

Question 3

I did the experiment with g++ and with -O3 it looks optimized. However, I have to say that the optimization looks exactly the same as what dvntehn00bz offers in his answer.

400860:       48 8b 45 00             mov    0x0(%rbp),%rax
400864:       48 89 ef                mov    %rbp,%rdi
400867:       ff 50 10                callq  *0x10(%rax)
40086a:       48 83 eb 01             sub    $0x1,%rbx
40086e:       75 f0                   jne    400860 <main+0x40>

The whole code has two loops, completely automatically.

P.S. Of course, your classes are probably a lot more complex than my slim version so the optimization is certainly much easier for the compiler.