Can vtable overhead be avoided using a static_cast?

Question 1

Will this solution avoid the vtable overhead or will it still be affected?

It will still use dynamic dispatch (whether that causes any noticeable overhead is a completely different question). You can disable dynamic dispatch by qualifying the function call as in:

static_cast<T&>(obj).T::fn();

Although I would not even try to do so. Leave dynamic dispatch, then test the performance of the application, do some profiling, do further profiling. Profile again to make sure that you understand what the profiler is telling you. Only then, consider making a single change and profile again to verify whether your assumptions are correct or not.

Question 2

This isn't really an answer to your actual question, but I was curious as to "what really is the overhead of calling a virtual function vs calling a regular class function". To make it "fair", I created a classes.cpp that implements a very simple function, but it's a in separate file that is compiled outside of the "main".

classes.h:

#ifndef CLASSES_H
#define CLASSES_H

class base
{
    virtual int vfunc(int x) = 0;
};

class vclass : public base
{
public:
    int vfunc(int x);
};


class nvclass
{
public:
    int nvfunc(int x);
};


nvclass *nvfactory();
vclass* vfactory();


#endif

classes.cpp:

#include "classes.h"

int vclass:: vfunc(int x)
{
    return x+1;
}


int nvclass::nvfunc(int x)
{
    return x+1;
}

nvclass *nvfactory()
{
    return new nvclass;
}

vclass* vfactory()
{
    return new vclass;
}

This is called from:

#include <cstdio>
#include <cstdlib>
#include "classes.h"

#if 0
#define ASSERT(x) do { if(!(x)) { assert_fail( __FILE__, __LINE__, #x); } } while(0)
static void assert_fail(const char* file, int line, const char *cond)
{
    fprintf(stderr, "ASSERT failed at %s:%d condition: %s \n",  file, line, cond); 
    exit(1);
}
#else
#define ASSERT(x) (void)(x)
#endif

#define SIZE 10000000

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}


void print_avg(const char *str, const int *diff, int size)
{
    int i;
    long sum = 0;
    for(i = 0; i < size; i++)
    {
    int t = diff[i];
    sum += t;
    }

    printf("%s average =%f clocks\n", str, (double)sum / size);
}


int diff[SIZE]; 

int main()
{
    unsigned long long a, b;
    int i;
    int sum = 0;
    int x;

    vclass *v = vfactory();
    nvclass *nv = nvfactory();


    for(i = 0; i < SIZE; i++)
    {
    a = rdtsc();

    x = 16;
    sum+=x;
    b = rdtsc();

    diff[i] = (int)(b - a);
    }

    print_avg("Emtpy", diff, SIZE);


    for(i = 0; i < SIZE; i++)
    {
    a = rdtsc();

    x = 0;
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    x = v->vfunc(x);
    ASSERT(x == 4); 
    sum+=x;
    b = rdtsc();

    diff[i] = (int)(b - a);
    }

    print_avg("Virtual", diff, SIZE);

    for(i = 0; i < SIZE; i++)
    {
    a = rdtsc();
    x = 0;
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    x = nv->nvfunc(x);
    ASSERT(x == 4);     
    sum+=x;
    b = rdtsc();
    diff[i] = (int)(b - a);
    }
    print_avg("no virtual", diff, SIZE);

    printf("sum=%d\n", sum);

    delete v;
    delete nv;

    return 0;
}

The REAL difference in code is this: virtual call:

40066b: ff 10                   callq  *(%rax)

non virtual call:

4006d3: e8 78 01 00 00          callq  400850 <_ZN7nvclass6nvfuncEi>

And the results:

Emtpy average =78.686081 clocks
Virtual average =144.732567 clocks
no virtual average =122.781466 clocks
sum=480000000

Remember that that's the overhead for 16 calls per loop, so the difference between calling a function and not calling a function is around 5 clock cycles per iteration [including adding up the results and other processing required], and the virtual call adds 22 clocks per iteration, so around 1.5 clocks per call.

I doubt you will notice, assuming you do something a bit more meaningful than return x + 1 in your function.

Question 3

The VTable resides in your class. If you have virtual members, they will be accessed through the VTable. The cast will not affect whether or not the VTable exists, nor how members are accessed.

Question 4

If you have a polymorphic array, where the elements are polymorphic but all elements have the same type, you can also externalize the vtable. This allows you to look up the function once and then call it directly on each element. In that case, C++ doesn't help you though, you will have to do it manually.

This is also useful if you are microoptimizing things. I believe that Boost's function uses a similar technique. It only needs two functions (call and release reference) in the vtable, but the compiler-generated one would also contain RTTI and some other stuff, which can be avoided by hand-coding a vtable that only has those two function pointers.