It's not an answer to the original question; I am rather trying to resolve the issues between two answers and it wouldn't fit into a comment.
The trigonometric approach is 4x slower than your original version with the square-root function on my machine (Linux, Intel Core i5). Your mileage will vary.
The
asm ("");
is always a bad smell with his siblingsvolatile
and(void) x
.Running a tight loop many-many times is a very unreliable way of benchmarking.
What to do instead?
Analyze the generated assembly code to see what the compiler actually did to your source code.
Use a profiler. I can recommend
perf
orIntel VTune
.
If you look at the assembly code of your micro-benchmark, you will see that the compiler is very smart and figured out that v1 and v2 are not changing and eliminated as much work as it could at compile time. At runtime, no calls were made to sqrtf
or to acosf
and cosf
. That explains why you did not see any difference between the two approaches.
Here is an edited version of your benchmark. I scrambled it a bit and guarded against division by zero with 1.0e-6f
. (It doesn't change the conclusions.)
#include <stdio.h>
#include <math.h>
#ifdef USE_NORMALIZE
#warning "Using normalize"
void mid_v3_v3v3_slerp(float res[3], const float v1[3], const float v2[3])
{
float m;
float v[3] = { (v1[0] + v2[0]), (v1[1] + v2[1]), (v1[2] + v2[2]) };
m = 1.0f / sqrtf(v[0] * v[0] + v[1] * v[1] + v[2] * v[2] + 1.0e-6f);
v[0] *= m;
v[1] *= m;
v[2] *= m;
res[0] = v[0];
res[1] = v[1];
res[2] = v[2];
}
#else
#warning "Not using normalize"
void mid_v3_v3v3_slerp(float v[3], const float v1[3], const float v2[3])
{
const float dot_product = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
const float theta = acosf(dot_product);
const float n = 1.0f / (2.0f * cosf(theta * 0.5f) + 1.0e-6f);
v[0] = (v1[0] + v2[0]) * n;
v[1] = (v1[1] + v2[1]) * n;
v[2] = (v1[2] + v2[2]) * n;
}
#endif
int main(void)
{
unsigned long long int i = 20000000;
float v1[3] = {-0.8659117221832275, 0.4995948076248169, 0.024538060650229454};
float v2[3] = {0.7000154256820679, 0.7031427621841431, -0.12477479875087738};
float v[3] = { 0.0, 0.0, 0.0 };
while (--i) {
mid_v3_v3v3_slerp( v, v1, v2);
mid_v3_v3v3_slerp(v1, v, v2);
mid_v3_v3v3_slerp(v1, v2, v );
}
printf("done %f %f %f\n", v[0], v[1], v[2]);
return 0;
}
I compiled it with gcc -ggdb3 -O3 -Wall -Wextra -fwhole-program -DUSE_NORMALIZE -march=native -static normal.c -lm
and profiled the code with perf
.
The trigonometric approach is 4x slower and it is because the expensive cosf
and acosf
functions.
I have tested the Intel C++ Compiler as well: icc -Ofast -Wall -Wextra -ip -xHost normal.c
; the conclusion is the same, although gcc generates approximately 10% slower code (for -Ofast
as well).
I wouldn't even try to implement an approximate sqrtf
: It is already an intrinsic and chances are, your approximation will only be slower...
Having said all these, I don't know the answer to the original question. I thought about it and I also suspect that the there might be another way that doesn't involve the square-root function.
Interesting question in theory; in practice, I doubt that getting rid of that square-root would make any difference in speed in your application.