Is it possible to roll a significantly faster version of sqrt

https://stackoverflow.com/questions/2637700

26-09-2019
|

Question

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.

I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.

MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.

See also here for similar discussion on modf function.

EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?

Solution

Yes, it is possible even without trickery:

1) sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.

2) lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.

3) caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.

OTHER TIPS

There's a great comparison table here: http://assemblyrequired.crashworks.org/timing-square-root/

Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).

Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.

You're very likely to gain more speed improvements by changing your algorithms than by changing their implementations: Try to call sqrt() less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt() you mention are just that: improvements of the algorithm used to calculate a square root.)

Since it is used very often, it is likely that your standard library's implementation of sqrt() is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.

Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.

How accurate do you need your sqrt to be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square root function for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).

Don't know if you fixed this, but I've read about it before, and it seems that the fastest thing to do is replace the sqrt function with an inline assembly version;

you can see a description of a load of alternatives here.

The best is this snippet of magic:

double inline __declspec (naked) __fastcall sqrt(double n)
{
    _asm fld qword ptr [esp+4]
    _asm fsqrt
    _asm ret 8
}

It's about 4.7x faster than the standard sqrt call with the same precision.

Here is a fast way with a look up table of only 8KB. Mistake is ~0.5% of the result. You can easily enlarge the table, thus reducing the mistake. Runs about 5 times faster than the regular sqrt()

// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11;                       // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory. 
const int nUnusedBits   = 23 - nBitsForSQRTprecision;       // Amount of bits we will disregard
const int tableSize     = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize]; 
static unsigned char is_sqrttab_initialized = FALSE;        // Once initialized will be true

// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
    unsigned short i;
    float f;
    UINT32 *fi = (UINT32*)&f;

    if (is_sqrttab_initialized)
        return;

    const int halfTableSize = (tableSize>>1);
    for (i=0; i < halfTableSize; i++){
         *fi = 0;
         *fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127

         // Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
         f = sqrtf(f);
         sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);

         // Repeat the process, this time with an exponent of 1, stored as 128
         *fi = 0;
         *fi = (i << nUnusedBits) | (128 << 23);
         f = sqrtf(f);
         sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
    }
    is_sqrttab_initialized = TRUE;
}

// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
    if (n <= 0.f) 
        return 0.f;                           // On 0 or negative return 0.
    UINT32 *num = (UINT32*)&n;
    short e;                                  // Exponent
    e = (*num >> 23) - 127;                   // In 'float' the exponent is stored with 127 added.
    *num &= 0x7fffff;                         // leave only the mantissa 

    // If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
    const int halfTableSize = (tableSize>>1);
    const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
    if (e & 0x01) 
        *num |= secondHalphTableIdBit;  
    e >>= 1;                                  // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands

    // Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
    *num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
    return n;
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow