Why is linker optimization so poor?

https://stackoverflow.com/questions/1401342

05-07-2019
|

Question

Recently, a coworker pointed out to me that compiling everything into a single file created much more efficient code than compiling separate object files - even with link time optimization turned on. In addition, the total compile time for the project went down significantly. Given that one of the primary reasons for using C++ is code efficiency, this was surprising to me.

Clearly, when the archiver/linker makes a library out of object files, or links them into an executable, even simple optimizations are penalized. In the example below, trivial inlining costs 1.8% in performance when done by the linker instead of the compiler. It seems like compiler technology should be advanced enough to handle fairly common situations like this, but it isn't happening.

Here is a simple example using Visual Studio 2008:

#include <cstdlib>
#include <iostream>
#include <boost/timer.hpp>

using namespace std;

int foo(int x);
int foo2(int x) { return x++; }

int main(int argc, char** argv)
{
  boost::timer t;

  t.restart();
  for (int i=0; i<atoi(argv[1]); i++)
    foo (i);
  cout << "time : " << t.elapsed() << endl;

  t.restart();
  for (int i=0; i<atoi(argv[1]); i++)
    foo2 (i);
  cout << "time : " << t.elapsed() << endl;
}

foo.cpp

int foo (int x) { return x++; }

Results of run: 1.8% performance hit to using linked foo instead of inline foo2.

$ ./release/testlink.exe  100000000
time : 13.375
time : 13.14

And yes, the linker optimization flags (/LTCG) are on.

Solution

I'm not a compiler specialist, but I think the compiler has much more information available at disposal to optimize as it operates on a language tree, as opposed to the linker that has to content itself to operate on the object output, far less expressive than the code the compiler has seen. Hence less effort is spent by linker and compiler development team(s) into making linker optimizations that could match, in theory, the tricks the compiler does.

BTW, I'm sorry I distracted your original question into the ltcg discussion. I now understand your question was a little bit different, more concerned with the link time vs. compile time static optimizations possible/available.

OTHER TIPS

Your coworker is out of date. The technology has been here since 2003 (on the MS C++ compiler): /LTCG. Link time code generation is dealing with exactly this problem. From what I know the GCC has this feature on the radar for the next generation compiler.

LTCG does not only optimize the code like inlining functions across modules, but actually rearanges code to optimize cache locality and branching for a specific load, see Profile-Guided Optimizations. These options are usualy reserved only for Release builds as the build can take hours to finish: will link a instrumented executable, run a profiling load and then link again with the profiling results. The link contains details about what exactly is optimized with LTCG:

Inlining – For example, if there exists a function A that frequently calls function B, and function B is relatively small, then profile-guided optimizations will inline function B in function A.

Virtual Call Speculation – If a virtual call, or other call through a function pointer, frequently targets a certain function, a profile-guided optimization can insert a conditionally-executed direct call to the frequently-targeted function, and the direct call can be inlined.

Register Allocation – Optimizing with profile data results in better register allocation.

Basic Block Optimization – Basic block optimization allows commonly executed basic blocks that temporally execute within a given frame to be placed in the same set of pages (locality). This minimizes the number of pages used, thus minimizing memory overhead.

Size/Speed Optimization – Functions where the program spends a lot of time can be optimized for speed.

Function Layout – Based on the call graph and profiled caller/callee behavior, functions that tend to be along the same execution path are placed in the same section.

Conditional Branch Optimization – With the value probes, profile-guided optimizations can find if a given value in a switch statement is used more often than other values. This value can then be pulled out of the switch statement. The same can be done with if/else instructions where the optimizer can order the if/else so that either the if or else block is placed first depending on which block is more frequently true.

Dead Code Separation – Code that is not called during profiling is moved to a special section that is appended to the end of the set of sections. This effectively keeps this section out of the often-used pages.

EH Code Separation – The EH code, being exceptionally executed, can often be moved to a separate section when profile-guided optimizations can determine that the exceptions occur only on exceptional conditions.

Memory Intrinsics – The expansion of intrinsics can be decided better if it can be determined if an intrinsic is called frequently. An intrinsic can also be optimized based on the block size of moves or copies.

Your coworker is smarter than most of us. Even if it seems a crude approach at first, project inlining into a single .cpp file has one thing that the other approaches like link-time-optimization does not have and will not have for while - reliability

However, you asked this two years ago, and I testify that a lot has changed since then (with g++ at least). Devirtualization is a lot more reliable, for instance.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow