SQRT vs RSQRT vs SSE _mm_rsqrt_ps Benchmark

https://stackoverflow.com/questions/15175654

16-03-2022
|

سؤال

I have not found any clear benchmark about this subject so I made one. I will post it here in case anybody is looking for this like me.

I have one question though. Isn't SSE supposed to be 4 times faster than four fpu RSQRT in a loop? It is faster but a merely 1.5 times. Is moving to SSE registers having this much impact because I do not do a lot of calculations, but only rsqrt? Or is it because SSE rsqrt is much more precise, how do I find how many iterations sse rsqrt does? The two results:

4 align16 float[4] RSQRT: 87011us 2236.07 - 2236.07 - 2236.07 - 2236.07
4 SSE align16 float[4]  RSQRT: 60008us 2236.07 - 2236.07 - 2236.07 - 2236.07

Edit

Compiled using MSVC 11 /GS- /Gy /fp:fast /arch:SSE2 /Ox /Oy- /GL /Oi on AMD Athlon II X2 270

The test code:

#include <iostream>
#include <chrono>
#include <th/thutility.h>

int main(void)
{
    float i;
    //long i;
    float res;
    __declspec(align(16)) float var[4] = {0};

    auto t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
        res = sqrt(i);
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << "1 float SQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << res << std::endl;

    t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
    {
         thutility::math::rsqrt(i, res);
         res *= i;
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << "1 float RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << res << std::endl;

    t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
    {
         thutility::math::rsqrt(i, var[0]);
         var[0] *= i;
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << "1 align16 float[4] RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " <<  var[0] << std::endl;

    t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
    {
         thutility::math::rsqrt(i, var[0]);
         var[0] *= i;
         thutility::math::rsqrt(i, var[1]);
         var[1] *= i + 1;
         thutility::math::rsqrt(i, var[2]);
         var[2] *= i + 2;
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << "3 align16 float[4] RSQRT: "
        << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " 
        << var[0] << " - " << var[1] << " - " << var[2] << std::endl;

    t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
    {
         thutility::math::rsqrt(i, var[0]);
         var[0] *= i;
         thutility::math::rsqrt(i, var[1]);
         var[1] *= i + 1;
         thutility::math::rsqrt(i, var[2]);
         var[2] *= i + 2;
         thutility::math::rsqrt(i, var[3]);
         var[3] *= i + 3;
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << "4 align16 float[4] RSQRT: "
        << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " 
        << var[0] << " - " << var[1] << " - " << var[2] << " - " << var[3] << std::endl;

    t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
    {
        var[0] = i;
        __m128& cache = reinterpret_cast<__m128&>(var);
        __m128 mmsqrt = _mm_rsqrt_ss(cache);
        cache = _mm_mul_ss(cache, mmsqrt);
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << "1 SSE align16 float[4]  RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()
        << "us " << var[0] << std::endl;

    t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
    {
        var[0] = i;
        var[1] = i + 1;
        var[2] = i + 2;
        var[3] = i + 3;
        __m128& cache = reinterpret_cast<__m128&>(var);
        __m128 mmsqrt = _mm_rsqrt_ps(cache);
        cache = _mm_mul_ps(cache, mmsqrt);
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << "4 SSE align16 float[4]  RSQRT: "
        << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << var[0] << " - "
        << var[1] << " - " << var[2] << " - " << var[3] << std::endl;

    system("PAUSE");
}

Results using float type:

1 float SQRT: 24996us 2236.07
1 float RSQRT: 28003us 2236.07
1 align16 float[4] RSQRT: 32004us 2236.07
3 align16 float[4] RSQRT: 51013us 2236.07 - 2236.07 - 5e+006
4 align16 float[4] RSQRT: 87011us 2236.07 - 2236.07 - 2236.07 - 2236.07
1 SSE align16 float[4]  RSQRT: 46999us 2236.07
4 SSE align16 float[4]  RSQRT: 60008us 2236.07 - 2236.07 - 2236.07 - 2236.07

My conclusion is not it is not worth bothering with SSE2 unless we make calculations on no less than 4 variables. (Maybe this applies to only rsqrt here but it is an expensive calculation (it also includes multiple multiplications) so it probably applies to other calculations too)

Also sqrt(x) is faster than x*rsqrt(x) with two iterations, and x*rsqrt(x) with one iteration is too inaccurate for distance calculation.

So the statements that I have seen on some boards that x*rsqrt(x) is faster than sqrt(x) is wrong. So it is not logical and does not worth the precision loss to use rsqrt instead of sqrt unless you directly need 1/x^(1/2).

Tried with no SSE2 flag (in case it applied SSE on normal rsqrt loop, it gave same results).

My RSQRT is a modified (same) version of quake rsqrt.

namespace thutility
{
    namespace math
    {
        void rsqrt(const float& number, float& res)
        {
              const float threehalfs = 1.5F;
              const float x2 = number * 0.5F;

              res = number;
              uint32_t& i = *reinterpret_cast<uint32_t *>(&res);    // evil floating point bit level hacking
              i  = 0x5f3759df - ( i >> 1 );                             // what the fuck?
              res = res * ( threehalfs - ( x2 * res * res ) );   // 1st iteration
              res = res * ( threehalfs - ( x2 * res * res ) );   // 2nd iteration, this can be removed
        }
    }
}

المحلول

It's easy to get a lot of unnecessary overhead in SSE code.

If you want to ensure that your code is efficient, look at the compiler's disassembly. One thing that often kills performance (and it looks like it might affect you) is moving data between memory and SSE registers unnecessarily.

Inside your loop, you should keep all the relevant data, as well as the result, in SSE registers, rather than in a float[4].

As long as you're accessing memory, verify that the compiler generates an aligned move instruction to load the data into registers or to write it back to the array.

And check that the generated SSE instructions don't have a lot of unnecessary move instructions and other cruft in between them. Some compilers are pretty horrible at generating SSE code from intrinsics, so it pays to keep an eye on the code it generates.

Finally, you'll need to consult your CPU's manual/specifications to ensure that it actually executes the packed instructions that you use just as fast it does scalar instructions. (For modern CPUs I'd believe them to do so, but some older CPUs at least required a bit of additional time for packed instructions. Not four times as long as a scalar one, but enough that you couldn't reach a 4x speedup)

نصائح أخرى

Also sqrt(x) is faster than x*rsqrt(x) with two iterations, and x*rsqrt(x) with one iteration is too inaccurate for distance calculation.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow