I have not found any clear benchmark about this subject so I made one. I will post it here in case anybody is looking for this like me.
I have one question though. Isn't SSE supposed to be 4 times faster than four fpu RSQRT in a loop? It is faster but a merely 1.5 times. Is moving to SSE registers having this much impact because I do not do a lot of calculations, but only rsqrt? Or is it because SSE rsqrt is much more precise, how do I find how many iterations sse rsqrt does? The two results:
4 align16 float[4] RSQRT: 87011us 2236.07 - 2236.07 - 2236.07 - 2236.07
4 SSE align16 float[4] RSQRT: 60008us 2236.07 - 2236.07 - 2236.07 - 2236.07
Edit
Compiled using MSVC 11 /GS- /Gy /fp:fast /arch:SSE2 /Ox /Oy- /GL /Oi
on AMD Athlon II X2 270
The test code:
#include <iostream>
#include <chrono>
#include <th/thutility.h>
int main(void)
{
float i;
//long i;
float res;
__declspec(align(16)) float var[4] = {0};
auto t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
res = sqrt(i);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "1 float SQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << res << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
thutility::math::rsqrt(i, res);
res *= i;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "1 float RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << res << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
thutility::math::rsqrt(i, var[0]);
var[0] *= i;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "1 align16 float[4] RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << var[0] << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
thutility::math::rsqrt(i, var[0]);
var[0] *= i;
thutility::math::rsqrt(i, var[1]);
var[1] *= i + 1;
thutility::math::rsqrt(i, var[2]);
var[2] *= i + 2;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "3 align16 float[4] RSQRT: "
<< std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us "
<< var[0] << " - " << var[1] << " - " << var[2] << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
thutility::math::rsqrt(i, var[0]);
var[0] *= i;
thutility::math::rsqrt(i, var[1]);
var[1] *= i + 1;
thutility::math::rsqrt(i, var[2]);
var[2] *= i + 2;
thutility::math::rsqrt(i, var[3]);
var[3] *= i + 3;
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "4 align16 float[4] RSQRT: "
<< std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us "
<< var[0] << " - " << var[1] << " - " << var[2] << " - " << var[3] << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
var[0] = i;
__m128& cache = reinterpret_cast<__m128&>(var);
__m128 mmsqrt = _mm_rsqrt_ss(cache);
cache = _mm_mul_ss(cache, mmsqrt);
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "1 SSE align16 float[4] RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()
<< "us " << var[0] << std::endl;
t1 = std::chrono::high_resolution_clock::now();
for(i = 0; i < 5000000; i+=1)
{
var[0] = i;
var[1] = i + 1;
var[2] = i + 2;
var[3] = i + 3;
__m128& cache = reinterpret_cast<__m128&>(var);
__m128 mmsqrt = _mm_rsqrt_ps(cache);
cache = _mm_mul_ps(cache, mmsqrt);
}
t2 = std::chrono::high_resolution_clock::now();
std::cout << "4 SSE align16 float[4] RSQRT: "
<< std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << var[0] << " - "
<< var[1] << " - " << var[2] << " - " << var[3] << std::endl;
system("PAUSE");
}
Results using float type:
1 float SQRT: 24996us 2236.07
1 float RSQRT: 28003us 2236.07
1 align16 float[4] RSQRT: 32004us 2236.07
3 align16 float[4] RSQRT: 51013us 2236.07 - 2236.07 - 5e+006
4 align16 float[4] RSQRT: 87011us 2236.07 - 2236.07 - 2236.07 - 2236.07
1 SSE align16 float[4] RSQRT: 46999us 2236.07
4 SSE align16 float[4] RSQRT: 60008us 2236.07 - 2236.07 - 2236.07 - 2236.07
My conclusion is not it is not worth bothering with SSE2 unless we make calculations on no less than 4 variables. (Maybe this applies to only rsqrt here but it is an expensive calculation (it also includes multiple multiplications) so it probably applies to other calculations too)
Also sqrt(x) is faster than x*rsqrt(x) with two iterations, and x*rsqrt(x) with one iteration is too inaccurate for distance calculation.
So the statements that I have seen on some boards that x*rsqrt(x) is faster than sqrt(x) is wrong. So it is not logical and does not worth the precision loss to use rsqrt instead of sqrt unless you directly need 1/x^(1/2).
Tried with no SSE2 flag (in case it applied SSE on normal rsqrt loop, it gave same results).
My RSQRT is a modified (same) version of quake rsqrt.
namespace thutility
{
namespace math
{
void rsqrt(const float& number, float& res)
{
const float threehalfs = 1.5F;
const float x2 = number * 0.5F;
res = number;
uint32_t& i = *reinterpret_cast<uint32_t *>(&res); // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the fuck?
res = res * ( threehalfs - ( x2 * res * res ) ); // 1st iteration
res = res * ( threehalfs - ( x2 * res * res ) ); // 2nd iteration, this can be removed
}
}
}