Function pointer runs faster than inline function. Why?

Question 1

Oh s**t (do I need to censor swearing here?), I found it out. It was somehow related to the timing being inside the loop. When I moved it outside as following,

#include <iostream>
#include <chrono>
inline short toBigEndian(short i)
{
    return (i<<8)|(i>>8);
}

short (*toBigEndianPtr)(short i)=toBigEndian;
int main()
{  
    int total=0;
    auto begin=std::chrono::high_resolution_clock::now();
    for(int i=0;i<100000000;i++)
    {
        short a=toBigEndianPtr((short)i);
        total+=a;
    }
    auto end=std::chrono::high_resolution_clock::now();
    std::cout<<std::chrono::duration_cast<std::chrono::duration<double>>(end-begin).count()<<", "<<total<<std::endl;
    return 0;
}

the results are just as they should be. 0.08 seconds for inline, 0.20 seconds for pointer. Sorry for bothering you guys.

Question 2

Short answer: it isn't.

You compile with -O0, wich does not optimize (much). Without optimization, you have no saying in "fast", because unptimized code is not as fast as can be.
You take the address of toBigEndian, wich prevents inlining. inline keyword is a hint for the compiler anyway, wich it may or may not follow. You did the best to not make it follow that hint.

So, to give your measurements any meaning,

optimize your code
use two functions, doing the same thing, one that gets inlined, the other one taken the addres of

Question 3

A common mistake in measuring performance (besides forgetting to optimize) is to use the wrong tool to measure. Using std::chrono would be fine, if you were measuring the performance of your entire, 10000000 or 500000000 iterations. Instead, you are asking it to measure the call / inline of toBigEndian. A function that is all of 6 instructions. So I switched to rdtsc (read time stamp counter, i.e. clock cycles).

Allowing the compiler to really optimize everything in the loop, not cluttering it with recording the time on every tiny iteration, we have a different code sequence. Now, after compiling with g++ -O3 fp_test.cpp -o fp_test -std=c++11, I observe the desired effect. The inlined version averages around 2.15 cycles per iteration, while the function pointer takes around 7.0 cycles per iteration.

Even without using rdtsc, the difference is still quite observable. The wall clock time was 360ms for the inlined code and 1.17s for the function pointer. So one could use std::chrono in place of rdtsc in this code.

Modified code follows:

#include <iostream>
static inline uint64_t rdtsc(void)
{
  uint32_t hi, lo;
  asm volatile ("rdtsc" : "=a"(lo), "=d"(hi));
  return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}
inline short toBigEndian(short i)
{
    return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
#define LOOP_COUNT 500000000
int main()
{
    uint64_t t = 0, begin=0, end=0;
    int total=0;
    begin=rdtsc();
    for(int i=0;i<LOOP_COUNT;i++)
    {
        short a=0;
        a=toBigEndianPtr((short)i);
        //a=toBigEndian((short)i);
        total+=a;   
    }
    end=rdtsc();
    t+=(end-begin);
    std::cout<<((double)t/LOOP_COUNT)<<", "<<total<<std::endl;
    return 0;
}

Question 4

First off, with -O0, you aren't running the optimizer, which means the compiler is ignoring your request to inline, as it is free to do. The cost of the two different calls ought to be nearly identical. Try with -O2.

Second, if you are only running for 0.22 seconds, weirdly variable costs involved with starting your program totally dominate the cost of running the test function. That function call is just a few instructions. If your CPU is running at 2 GHz, it ought to execute that function call in something like 20 nanoseconds, so you can see that whatever it is you're measuring, it's not the cost of running that function.

Try calling the test function in a loop, say 1,000,000 times. Make the number of loops 10x bigger until it takes > 10 seconds to run the test. Then divide the result by the number of loops for an approximation of the cost of the operation.

Question 5

With many/most self-respecting modern compilers, the code you posted will still inline the function call even when when it is called through the pointer. (Assuming the compiler makes a reasonable effort to optimize the code). The situation is just too easy to see through. In other words, the generated code can easily end up virtually the same in both cases, meaning that your test is not really useful for measuring what you are trying to measure.

If you really want to make sure the call is physically performed through the pointer, you have to make an effort to "confuse" the compiler to the point where it can't figure out the pointer value at compile time. For example, make the pointer value run-time dependent, as in

toBigEndianPtr = rand() % 1000 != 0 ? toBigEndian : NULL;

or something along these lines. You can also declare your function pointer as volatile, which will typically cause a genuine through-the-pointer call each time as well as force the compiler to re-read the pointer value from memory on each iteration.