Can I change this macro to an inline function without a performance hit?

https://stackoverflow.com/questions/8221275

06-03-2021
|

Question

(EDIT: Let's title this, "Lessons in how measurements can go wrong." I still haven't figured out exactly what's causing the discrepancy though.)

I found a very fast integer square root function here by Mark Crowne. At least with GCC on my machine, it's clearly the fastest integer square root function I've tested (including the functions in Hacker's Delight, this page, and floor(sqrt()) from the standard library).

After cleaning up the formatting a bit, renaming a variable, and using fixed-width types, it looks like this:

static uint32_t mcrowne_isqrt(uint32_t val)
{
    uint32_t temp, root = 0;

    if (val >= 0x40000000)
    {
        root = 0x8000;
        val -= 0x40000000;
    }

    #define INNER_ISQRT(s)                              \
    do                                                  \
    {                                                   \
        temp = (root << (s)) + (1 << ((s) * 2 - 2));    \
        if (val >= temp)                                \
        {                                               \
            root += 1 << ((s)-1);                       \
            val -= temp;                                \
        }                                               \
    } while(0)

    INNER_ISQRT(15);
    INNER_ISQRT(14);
    INNER_ISQRT(13);
    INNER_ISQRT(12);
    INNER_ISQRT(11);
    INNER_ISQRT(10);
    INNER_ISQRT( 9);
    INNER_ISQRT( 8);
    INNER_ISQRT( 7);
    INNER_ISQRT( 6);
    INNER_ISQRT( 5);
    INNER_ISQRT( 4);
    INNER_ISQRT( 3);
    INNER_ISQRT( 2);

    #undef INNER_ISQRT

    temp = root + root + 1;
    if (val >= temp)
        root++;
    return root;
}

The INNER_ISQRT macro isn't too evil, since it's local and immediately undefined after it's no longer needed. Nevertheless, I'd still like to convert it to an inline function as a matter of principle. I've read assertions in several places (including the GCC documentation) that inline functions are "just as fast" as macros, but I've had trouble converting it without a speed hit.

My current iteration looks like this (note the always_inline attribute, which I threw in for good measure):

static inline void inner_isqrt(const uint32_t s, uint32_t& val, uint32_t& root) __attribute__((always_inline));
static inline void inner_isqrt(const uint32_t s, uint32_t& val, uint32_t& root)
{
    const uint32_t temp = (root << s) + (1 << ((s << 1) - 2));
    if(val >= temp)
    {
        root += 1 << (s - 1);
        val -= temp;
    }
}

//  Note that I just now changed the name to mcrowne_inline_isqrt, so people can compile my full test.
static uint32_t mcrowne_inline_isqrt(uint32_t val)
{
    uint32_t root = 0;

    if(val >= 0x40000000)
    {
        root = 0x8000; 
        val -= 0x40000000;
    }

    inner_isqrt(15, val, root);
    inner_isqrt(14, val, root);
    inner_isqrt(13, val, root);
    inner_isqrt(12, val, root);
    inner_isqrt(11, val, root);
    inner_isqrt(10, val, root);
    inner_isqrt(9, val, root);
    inner_isqrt(8, val, root);
    inner_isqrt(7, val, root);
    inner_isqrt(6, val, root);
    inner_isqrt(5, val, root);
    inner_isqrt(4, val, root);
    inner_isqrt(3, val, root);
    inner_isqrt(2, val, root);

    const uint32_t temp = root + root + 1;
    if (val >= temp)
        root++;
    return root;
}

No matter what I do, the inline function is always slower than the macro. The macro version commonly times at around 2.92s for (2^28 - 1) iterations with an -O2 build, whereas the inline version commonly times at around 3.25s. EDIT: I said 2^32 - 1 iterations before, but I forgot that I had changed it. They take quite a bit longer for the full gamut.

It's possible that the compiler is just being stupid and refusing to inline it (note again the always_inline attribute!), but if so, that would make the macro version generally preferable anyway. (I tried checking the assembly to see, but it was too complicated as part of a program. The optimizer omitted everything when I tried compiling just the functions of course, and I'm having issues compiling it as a library due to noobishness with GCC.)

In short, is there a way to write this as an inline without a speed hit? (I haven't profiled, but sqrt is one of those fundamental operations that should always be made fast, since I may be using it in many other programs than just the one I'm currently interested in. Besides, I'm just curious.)

I've even tried using templates to "bake in" the constant value, but I get the feeling the other two parameters are more likely to be causing the hit (and the macro can avoid that, since it uses local variables directly)...well, either that or the compiler is stubbornly refusing to inline.

UPDATE: user1034749 below is getting the same assembly output from both functions when he puts them in separate files and compiles them. I tried his exact command line, and I'm getting the same result as him. For all intents and purposes, this question is solved.

However, I'd still like to know why my measurements are coming out differently. Obviously, my measurement code or original build process was causing things to be different. I'll post the code below. Does anyone know what the deal was? Maybe my compiler is actually inlining the whole mcrowne_isqrt() function in the loop of my main() function, but it's not inlining the entirety of the other version?

UPDATE 2 (squeezed in before testing code): Note that if I swap the order of the tests and make the inline version come first, the inline version comes out faster than the macro version by the same amount. Is this a caching issue, or is the compiler inlining one call but not the other, or what?

#include <iostream>
#include <time.h>      //  Linux high-resolution timer
#include <stdint.h>

/*  Functions go here */

timespec timespecdiff(const timespec& start, const timespec& end)
{
    timespec elapsed;
    timespec endmod = end;
    if(endmod.tv_nsec < start.tv_nsec)
    {
        endmod.tv_sec -= 1;
        endmod.tv_nsec += 1000000000;
    }

    elapsed.tv_sec = endmod.tv_sec - start.tv_sec;
    elapsed.tv_nsec = endmod.tv_nsec - start.tv_nsec;
    return elapsed;
}


int main()
{
    uint64_t inputlimit = 4294967295;
    //  Test a wide range of values
    uint64_t widestep = 16;

    timespec start, end;

    //  Time macro version:
    uint32_t sum = 0;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for(uint64_t num = (widestep - 1); num <= inputlimit; num += widestep)
    {
        sum += mcrowne_isqrt(uint32_t(num));
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    timespec markcrowntime = timespecdiff(start, end);
    std::cout << "Done timing Mark Crowne's sqrt variant.  Sum of results = " << sum << " (to avoid over-optimization)." << std::endl;


    //  Time inline version:
    sum = 0;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for(uint64_t num = (widestep - 1); num <= inputlimit; num += widestep)
    {
        sum += mcrowne_inline_isqrt(uint32_t(num));
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    timespec markcrowninlinetime = timespecdiff(start, end);
    std::cout << "Done timing Mark Crowne's inline sqrt variant.  Sum of results = " << sum << " (to avoid over-optimization)." << std::endl;

    //  Results:
    std::cout << "Mark Crowne sqrt variant time:\t" << markcrowntime.tv_sec << "s, " << markcrowntime.tv_nsec << "ns" << std::endl;
    std::cout << "Mark Crowne inline sqrt variant time:\t" << markcrowninlinetime.tv_sec << "s, " << markcrowninlinetime.tv_nsec << "ns" << std::endl;
    std::cout << std::endl;
}

UPDATE 3: I still have no idea how to reliably compare the timing of different functions without the timing depending on the order of the tests. I'd greatly appreciate any tips!

However, if anyone else reading this is interested in fast sqrt implementations, I should mention: Mark Crowne's code tests faster than any other pure C/C++ version I've tried by a decent margin (despite reliability issues with testing), but the following SSE code seems like it might be a little bit faster still for a scalar 32-bit integer sqrt. It can't be generalized for full-blown 64-bit unsigned integer inputs without losing precision though (and the first signed conversion would also have to be replaced by a load intrinsic to handle values >= 2^63):

uint32_t sse_sqrt(uint64_t num)
{
    //  Uses 64-bit input, because SSE conversion functions treat all
    //  integers as signed (so conversion from a 32-bit value >= 2^31
    //  will be interpreted as negative).  As it stands, this function
    //  will similarly fail for values >= 2^63.
    //  It can also probably be made faster, since it generates a strange/
    //  useless movsd %xmm0,%xmm0 instruction before the sqrtsd.  It clears
    //  xmm0 first too with xorpd (seems unnecessary, but I could be wrong).
    __m128d result;
    __m128d num_as_sse_double = _mm_cvtsi64_sd(result, num);
    result = _mm_sqrt_sd(num_as_sse_double, num_as_sse_double);
    return _mm_cvttsd_si32(result);
}

Solution

I tried your code with gcc 4.5.3. I modified your second version of code to match the first one, for example:

(1 << ((s) * 2 - 2)

(1 << ((s << 1) - 1)

yes, s * 2 == s << 1, but "-2" and "-1"?

Also I modified your types replace uint32_t with "unsigned long", because of on my 64 bit machine "long" is not 32bit number.

And then I run:

g++ -ggdb -O2 -march=native -c -pipe inline.cpp
g++ -ggdb -O2 -march=native -c -pipe macros.cpp
objdump -d inline.o > inline.s
objdump -d macros.o > macros.s

I could use "-S" instead of "-c" to assembler, but I would like to see assembler without additional info.

and you know what?
The assembler completly the same, in the first and in the second verison. So I think your time measurements are just wrong.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow