Runtime optimization of static languages: JIT for C++?

https://stackoverflow.com/questions/780349

13-09-2019
|

Question

Is anyone using JIT tricks to improve the runtime performance of statically compiled languages such as C++? It seems like hotspot analysis and branch prediction based on observations made during runtime could improve the performance of any code, but maybe there's some fundamental strategic reason why making such observations and implementing changes during runtime are only possible in virtual machines. I distinctly recall overhearing C++ compiler writers mutter "you can do that for programs written in C++ too" while listening to dynamic language enthusiasts talk about collecting statistics and rearranging code, but my web searches for evidence to support this memory have come up dry.

Solution

Profile guided optimization is different than runtime optimization. The optimization is still done offline, based on profiling information, but once the binary is shipped there is no ongoing optimization, so if the usage patterns of the profile-guided optimization phase don't accurately reflect real-world usage then the results will be imperfect, and the program also won't adapt to different usage patterns.

You may be interesting in looking for information on HP's Dynamo, although that system focused on native binary -> native binary translation, although since C++ is almost exclusively compiled to native code I suppose that's exactly what you are looking for.

You may also want to take a look at LLVM, which is a compiler framework and intermediate representation that supports JIT compilation and runtime optimization, although I'm not sure if there are actually any LLVM-based runtimes that can compile C++ and execute + runtime optimize it yet.

OTHER TIPS

I did that kind of optimization quite a lot in the last years. It was for a graphic rendering API that I've implemented. Since the API defined several thousand different drawing modes as general purpose function was way to slow.

I ended up writing my own little Jit-compiler for a domain specific language (very close to asm, but with some high level control structures and local variables thrown in).

The performance improvement I got was between factor 10 and 60 (depended on the complexity of the compiled code), so the extra work paid off big time.

On the PC I would not start to write my own jit-compiler but use either LIBJIT or LLVM for the jit-compilation. It wasn't possible in my case due to the fact that I was working on a non mainstream embedded processor that is not supported by LIBJIT/LLVM, so I had to invent my own.

The answer is more likely: no one did more than PGO for C++ because the benefits are likely unnoticeable.

Let me elaborate: JIT engines/runtimes have both blesses and drawbacks from their developer's view: they have more information at runtime but much little time to analyze. Some optimizations are really expensive and you will unlikely see without a huge impact on start time are those one like: loop unrolling, auto-vectorization (which in most cases is also based on loop unrolling), instruction selection (to use SSE4.1 for CPU that use SSE4.1) combined with instruction scheduling and reordering (to use better super-scalar CPUs). This kind of optimizations combine great with C like code (that is accessible from C++).

The single full-blown compiler architecture to do advanced compilation (as far as I know) is the Java Hotspot compilation and architectures with similar principles using tiered compilation (Java Azul's systems, the popular to the day JaegerMonkey JS engine).

But one of the biggest optimization on runtime is the following:

Polymorphic inline caching (meaning that if you run the first loop with some types, the second time, the code of the loop will be specialized types that were from previous loop, and the JIT will put a guard and will put as default branch the inlined types, and based on it, from this specialized form using a SSA-form engine based will apply constant folding/propagation, inlining, dead-code-elimination optimizations, and depends of how "advanced" the JIT is, will do an improved or less improved CPU register assignment.) As you may notice, the JIT (hotspots) will improve mostly the branchy code, and with runtime information will get better than a C++ code, but a static compiler, having at it's side the time to do analysis, instruction reordering, for simple loops, will likely get a little better performance. Also, typically, the C++ code, areas that need to be fast tends to not be OOP, so the information of the JIT optimizations will not bring such an amazing improvement.

Another advantage of JITs is that JIT works cross assemblies, so it has more information if it wants to do inlining.

Let me elaborate: let's say that you have a base class A and you have just one implementation of it namely B in another package/assembly/gem/etc. and is loaded dynamically.

The JIT as it see that B is the only implementation of A, it can replace everywhere in it's internal representation the A calls with B codes, and the method calls will not do a dispatch (look on vtable) but will be direct calls. Those direct calls may be inlined also. For example this B have a method: getLength() which returns 2, all calls of getLength() may be reduced to constant 2 all over. At the end a C++ code will not be able to skip the virtual call of B from another dll.

Some implementations of C++ do not support to optimize over more .cpp files (even today there is the -lto flag in recent versions of GCC that makes this possible). But if you are a C++ developer, concerned about speed, you will likely put the all sensitive classes in the same static library or even in the same file, so the compiler can inline it nicely, making the extra information that JIT have it by design, to be provided by developer itself, so no performance loss.

visual studio has an option for doing runtime profiling that then can be used for optimization of code.

"Profile Guided Optimization"

Microsoft Visual Studio calls this "profile guided optimization"; you can learn more about it at MSDN. Basically, you run the program a bunch of times with a profiler attached to record its hotspots and other performance characteristics, and then you can feed the profiler's output into the compiler to get appropriate optimizations.

I believe LLVM attempts to do some of this. It attempts to optimize across the whole lifetime of the program (compile-time, link-time, and run-time).

Reasonable question - but with a doubtful premise.

As in Nils' answer, sometimes "optimization" means "low-level optimization", which is a nice subject in its own right.

However, it is based on the concept of a "hot-spot", which has nowhere near the relevance it is commonly given.

Definition: a hot-spot is a small region of code where a process's program counter spends a large percentage of its time.

If there is a hot-spot, such as a tight inner loop occupying a lot of time, it is worth trying to optimize at the low level, if it is in code that you control (i.e. not in a third-party library).

Now suppose that inner loop contains a call to a function, any function. Now the program counter is not likely to be found there, because it is more likely to be in the function. So while the code may be wasteful, it is no longer a hot-spot.

There are many common ways to make software slow, of which hot-spots are one. However, in my experience, that is the only one of which most programmers are aware, and the only one to which low-level optimization applies.

See this.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow