Re Q1: Cache utilization is only one part of the "problem" with virtual calls actually. The whole point of virtual
functions, and late binding in general, is that the call site may call any implementation without changes. This mandates some indirection:
- Indirection implies space and/or time overhead to resolve the indirection.
- No matter how you do the indirection, and indirect call can only be as fast as a static call if the CPU has a good branch predictor and the call site is monomorphic (i.e., only one implementation is ever called). And I'm not even sure if a perfectly-predicted branch is as fast as a static branch on all hardware developers care about.
- Not knowing the called function at compile time also inhibits optimization based on knowing the called function (inlining, but also loop invariant code motion and possibly more).
Your approach doesn't change that, and hence leaves most of the performance problems in place: It still wastes some time and space (only on additional arguments and branches, rather than vtable lookups), it doesn't permit inlining or other optimizations, and it doesn't remove indirect calls.
Re 2: It's kind of an interprocedural spin on devirtualization, which C++ compilers already do to some degree (locally, with limitations as described by @us2012 in the comments). There are a few "minor" problems with it, but it may be worth it if applied selectively. Otherwise, you generate a lot more code, pass a lot of additional arguments, do a lot of extra branches, and only gain very little or even a net loss.
I assume the main issue is that it doesn't solve most of the performance issues described above. Generating specialized functions (rather than one generic body) for subclasses, and other variations of the same theme, may help with that. But that generates additional code that has to justify itself with performance gains, and the general consensus is that such aggressive optimizations aren't worth it for most code even in performance-critial programs.
In particular, virtual call overhead only matters in benchmarks of that very same feature, or if you already optimized the ever-loving hell out of everything else and would need a huge lot of tiny indirect calls (an example from game development: Several virtual method calls per geometrical object for drawing or frustum culling). In most code, the virtual calls don't matter, or at least not enough to warrant further optimization attempts. In addition, this is only relevant for AOT compilers, as JIT compilers have other ways of dealing with these problems. Look up polymorphic inline caches, and note that tracing JIT compilers can trivially inline all calls, virtual or not.
In summary: vtables are already a fast and general way to implement virtual functions (if they can be used, which is the case here). It's unlikely you're be able to improve much on them, let alone notice the improvement, except perhaps in some rare cases. If you want to try it though, you could try writing an LLVM pass that does something like this (though you'd have to work on a lower level of abstraction).