Why does the cost of C++ virtual call depend on the number of derived classes?

https://stackoverflow.com/questions/18366431

26-06-2022
|

Pergunta

EDIT: At the request of n.m., I have included the complete code I was using despite its verbosity.

I've written a short example program that I was using to study the overhead of virtual calls.

timer.h:

#include <chrono>
#include <cstdint>

class Timer {
    typedef std::chrono::high_resolution_clock clock;
    typedef clock::time_point time_point;
    typedef clock::duration duration;

    time_point begin;
    time_point end;
public:
    void start()
    {
        begin = clock::now();
    }

    void stop()
    {
        end = clock::now();
    }

    double as_seconds()
    {
        duration dur = end - begin;

        return (double) dur.count() * (double) duration::period::num / (double) duration::period::den;
    }
};

virtual.h:

#pragma once

static const int NUM_DERIVED = 10;

struct Base {
    int val;

#ifdef NO_VIRTUAL
  #define virtual
#endif
    virtual void f();
    virtual ~Base();
#undef virtual
};

Base *create_base(unsigned max_derived);

virtual.cpp:

#include <algorithm>
#include <iostream>
#include <memory>
#include <typeinfo>
#include <vector>
#include "virtual.h"
#include "timer.h"

template <typename Container>
void measure_call(const Container &c)
{
    Timer t;

    t.start();
    for (auto &ptr : c) {
        ptr->f();
    }
    t.stop();

    std::cout << "Virtual calls: " << c.size() << '\n';
    std::cout << "Elapsed time (s): " << t.as_seconds() << '\n';
}

int main()
{
    typedef std::unique_ptr<Base> base_ptr;
    typedef std::vector<base_ptr> base_vector;

    const int NUM_VEC = 10000000;
    const int NUM_DERIVED_USED = NUM_DERIVED;

    base_vector vec1;
    base_vector vec2;

    Timer t;

    vec1.reserve(NUM_VEC);
    vec2.reserve(NUM_VEC);
    for (int i = 0; i < NUM_VEC; ++i) {
        Base *b1 = create_base(1);
        Base *b2 = create_base(NUM_DERIVED_USED);
        vec1.emplace_back(b1);
        vec2.emplace_back(b2);
    }

    std::cout << "Measuring random vector of " << 1 << " derived classes\n";
    measure_call(vec1);
    std::cout << '\n';

    std::cout << "Measuring random vector of " << NUM_DERIVED_USED << " derived classes\n";
    measure_call(vec2);
    std::cout << '\n';

    std::cout << "Sorted vector of " << NUM_DERIVED << " derived clases\n";
    std::sort(
        vec2.begin(), vec2.end(),
        [](const base_ptr &a, const base_ptr &b) -> bool
        {
            return typeid(*a).hash_code() < typeid(*b).hash_code();
        }
    );
    for (int i = 0; i < 20; ++i) {
        int idx = vec2.size() / 20 * i;
        std::cout << "vector[" << idx << "]: " << typeid(*vec2[idx]).name() << '\n';
    }
    std::cout << '\n';

    std::cout << "Measuring sorted vector of " << NUM_DERIVED_USED << " derived classes\n";
    measure_call(vec2);

    return 0;
}

virtual_impl.cpp:

#include <cstdlib>
#include "virtual.h"

template <int N>
struct Derived : Base {
    void f() { Base::val = N; }
    ~Derived() {}
};

void Base::f() {}

Base::~Base() {}

Base *create_base(unsigned max_derived)
{
    unsigned n = max_derived > NUM_DERIVED ? NUM_DERIVED : max_derived;

    switch (rand() % n) {
    case 9:
        return new Derived<9>;
    case 8:
        return new Derived<8>;
    case 7:
        return new Derived<7>;
    case 6:
        return new Derived<6>;
    case 5:
        return new Derived<5>;
    case 4:
        return new Derived<4>;
    case 3:
        return new Derived<3>;
    case 2:
        return new Derived<2>;
    case 1:
        return new Derived<1>;
    case 0:
        return new Derived<0>;
    default:
        return nullptr;
    }
}

I compile virtual_impl.cpp into a shared library to prevent the compiler from messing around and devirtualizing or inlining things:

$ g++ -shared --std=c++11 -O3 virtual_impl.cpp -o libvirtual_impl.so
$ g++ --std=c++11 -O3 virtual.cpp -L. -lvirtual_impl -o virtual
$ ./virtual

The result I get is:

Measuring random vector of 1 derived classes
Virtual calls: 10000000
Elapsed time (s): 0.039333

Measuring random vector of 10 derived classes
Virtual calls: 10000000
Elapsed time (s): 0.089604

Sorted vector of 10 derived clases
vector[0]: 7DerivedILi8EE
vector[500000]: 7DerivedILi8EE
vector[1000000]: 7DerivedILi8EE
vector[1500000]: 7DerivedILi7EE
vector[2000000]: 7DerivedILi7EE
vector[2500000]: 7DerivedILi2EE
vector[3000000]: 7DerivedILi2EE
vector[3500000]: 7DerivedILi9EE
vector[4000000]: 7DerivedILi9EE
vector[4500000]: 7DerivedILi0EE
vector[5000000]: 7DerivedILi1EE
vector[5500000]: 7DerivedILi1EE
vector[6000000]: 7DerivedILi1EE
vector[6500000]: 7DerivedILi6EE
vector[7000000]: 7DerivedILi6EE
vector[7500000]: 7DerivedILi4EE
vector[8000000]: 7DerivedILi4EE
vector[8500000]: 7DerivedILi3EE
vector[9000000]: 7DerivedILi5EE
vector[9500000]: 7DerivedILi5EE

Measuring sorted vector of 10 derived classes
Virtual calls: 10000000
Elapsed time (s): 0.115058

As you can see, with 10 derived classes, the cost is nearly 3x greater than with only one derived class. Branch prediction seems like a likely target, but somehow on a list sorted by type, the performance is even worse than the random list!

EDIT2: There doesn't seem to be any black magic occurring in the code generation. I found the disassembly for measure_call here:

(gdb) disassemble
Dump of assembler code for function _Z12measure_callISt6vectorISt10unique_ptrI4BaseSt14default_deleteIS2_EESaIS5_EEEvRKT_:
0x0000000100002170: push   r14
0x0000000100002172: push   r13
0x0000000100002174: push   r12
0x0000000100002176: mov    r12,rdi
0x0000000100002179: push   rbp
0x000000010000217a: push   rbx
0x000000010000217b: sub    rsp,0x10
0x000000010000217f: call   0x10000283e <dyld_stub__ZNSt6chrono3_V212system_clock3nowEv>
0x0000000100002184: mov    rbp,QWORD PTR [r12+0x8]
0x0000000100002189: mov    rbx,QWORD PTR [r12]
0x000000010000218d: mov    r13,rax
0x0000000100002190: cmp    rbx,rbp
0x0000000100002193: je     0x1000021b1 <_Z12measure_callISt6vectorISt10unique_ptrI4BaseSt14default_deleteIS2_EESaIS5_EEEvRKT_+65>
0x0000000100002195: nop    DWORD PTR [rax+rax+0x0]
0x000000010000219a: nop    WORD PTR [rax+rax+0x0]
0x00000001000021a0: mov    rdi,QWORD PTR [rbx]
0x00000001000021a3: add    rbx,0x8
0x00000001000021a7: mov    rcx,QWORD PTR [rdi]
0x00000001000021aa: call   QWORD PTR [rcx]
0x00000001000021ac: cmp    rbp,rbx
0x00000001000021af: jne    0x1000021a0 <_Z12measure_callISt6vectorISt10unique_ptrI4BaseSt14default_deleteIS2_EESaIS5_EEEvRKT_+48>
0x00000001000021b1: call   0x10000283e <dyld_stub__ZNSt6chrono3_V212system_clock3nowEv>
[iostream stuff follows]

Solução

The issue is this:

void f() { Base::val = N; }

and this:

for (auto &ptr : c) {
    ptr->f();
}

Sorting by type randomizes which memory location you are reading from (to get the vtable address) assigning to (Base::val) which will dwarf branch prediction for 10 vtables.

Outras dicas

You don't show the implementation of create_base, so I'm only guessing, but I think you're seeing the effect of branch prediction.

with create_base(1), all the pointers point at objects of the same derived class, so all the calls go to the same derived virtual function. So after the first one, the branch predictor correctly predicts the target address of the indirect branch and there is no branch delay.

With create_base(10), you're creating pointers to 10 different derived classes, each with its own derived virtual function. So every call is to a different address, causing the branch target predictor to miss much (90%?) of the time.

If you try using create_base(2), I think you'll see that its midway between the 1 and 10 cases (mispredict half the time), and larger values quickly approach the 10 case.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow