Calculate the middle of two unit length 3D vectors without a square root?

Question 1

It's not an answer to the original question; I am rather trying to resolve the issues between two answers and it wouldn't fit into a comment.

The trigonometric approach is 4x slower than your original version with the square-root function on my machine (Linux, Intel Core i5). Your mileage will vary.
The asm (""); is always a bad smell with his siblings volatile and (void) x.
Running a tight loop many-many times is a very unreliable way of benchmarking.

What to do instead?

Analyze the generated assembly code to see what the compiler actually did to your source code.
Use a profiler. I can recommend perf or Intel VTune.

If you look at the assembly code of your micro-benchmark, you will see that the compiler is very smart and figured out that v1 and v2 are not changing and eliminated as much work as it could at compile time. At runtime, no calls were made to sqrtf or to acosf and cosf. That explains why you did not see any difference between the two approaches.

Here is an edited version of your benchmark. I scrambled it a bit and guarded against division by zero with 1.0e-6f. (It doesn't change the conclusions.)

#include <stdio.h>
#include <math.h>

#ifdef USE_NORMALIZE
#warning "Using normalize"

void mid_v3_v3v3_slerp(float res[3], const float v1[3], const float v2[3])
{
    float m;

    float v[3] = { (v1[0] + v2[0]), (v1[1] + v2[1]), (v1[2] + v2[2]) };

    m = 1.0f / sqrtf(v[0] * v[0] + v[1] * v[1] + v[2] * v[2] + 1.0e-6f);

    v[0] *= m;
    v[1] *= m;
    v[2] *= m;

    res[0] = v[0];
    res[1] = v[1];
    res[2] = v[2];
}
#else
#warning "Not using normalize"

void mid_v3_v3v3_slerp(float v[3], const float v1[3], const float v2[3])
{
    const float dot_product = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
    const float theta = acosf(dot_product);
    const float n = 1.0f / (2.0f * cosf(theta * 0.5f) + 1.0e-6f);

    v[0] = (v1[0] + v2[0]) * n;
    v[1] = (v1[1] + v2[1]) * n;
    v[2] = (v1[2] + v2[2]) * n;
}
#endif

int main(void)
{
    unsigned long long int i = 20000000;
    float v1[3] = {-0.8659117221832275, 0.4995948076248169, 0.024538060650229454};
    float v2[3] = {0.7000154256820679, 0.7031427621841431, -0.12477479875087738};
    float v[3] = { 0.0, 0.0, 0.0 };

    while (--i) {
        mid_v3_v3v3_slerp( v, v1, v2);
        mid_v3_v3v3_slerp(v1,  v, v2);
        mid_v3_v3v3_slerp(v1, v2, v );  
    }

    printf("done %f %f %f\n", v[0], v[1], v[2]);
    return 0;
}

I compiled it with gcc -ggdb3 -O3 -Wall -Wextra -fwhole-program -DUSE_NORMALIZE -march=native -static normal.c -lm and profiled the code with perf.

The trigonometric approach is 4x slower and it is because the expensive cosf and acosf functions.

I have tested the Intel C++ Compiler as well: icc -Ofast -Wall -Wextra -ip -xHost normal.c; the conclusion is the same, although gcc generates approximately 10% slower code (for -Ofast as well).

I wouldn't even try to implement an approximate sqrtf: It is already an intrinsic and chances are, your approximation will only be slower...

Having said all these, I don't know the answer to the original question. I thought about it and I also suspect that the there might be another way that doesn't involve the square-root function.

Interesting question in theory; in practice, I doubt that getting rid of that square-root would make any difference in speed in your application.

Question 2

First, find the angle between your two vectors. From the principle of scalar projection, we know that

|a| * cos(theta) = a . b_hat

. being the dot product operator, |a| being the length of a, theta being the angle between a and b, and b_hat being the normalized form of b. In your situation, a and b are already unit vectors, so this simplifies to:

cos(theta) = a . b

which we can rearrange to:

theta = acos(a . b)

Lay vectors A and B end to end, and complete the triangle by drawing a line from the start of the first vector to the end of the second. Since two sides are of equal length, we know the triangle is isoceles, so it's easy to determine all of the angles if you already know theta.

enter image description here

That line with length N is the middle vector. We can normalize it if we divide it by N.

From the law of sines, we know that

sin(theta/2)/1 = sin(180-theta)/N

Which we can rearrange to get

N = sin(180-theta) / sin(theta/2)

Note that you'll divide by zero when calculating N if A and B are equal, so it may be useful to check for that corner case before starting.

Summary:

dot_product = a.x * b.x + a.y * b.y + a.z * b.z
theta = acos(dot_product)
N = sin(180-theta) / sin(theta/2)
middle_vector = [(a.x + b.x) / N, (a.y + b.y) / N, (a.z + b.z) / N]

Question 3

Based on the answers I made some speed comparison.

Edit. with this nieve benchmark, GCC optimizes out the trigonometry yeilding approx the same speeds for both methods, Read @Ali's post for a more complete explanation.

In summery, using re-normalizing is approx 4x faster.

#include <stdio.h>
#include <math.h>

/* gcc mid_v3_v3v3_slerp.c -lm -O3 -o mid_v3_v3v3_slerp_a
 * gcc mid_v3_v3v3_slerp.c -lm -O3 -o mid_v3_v3v3_slerp_b -DUSE_NORMALIZE
 *
 * time ./mid_v3_v3v3_slerp_a
 * time ./mid_v3_v3v3_slerp_b
 */

#ifdef USE_NORMALIZE
#warning "Using normalize"

void mid_v3_v3v3_slerp(float v[3], const float v1[3], const float v2[3])
{
    float m;

    v[0] = (v1[0] + v2[0]);
    v[1] = (v1[1] + v2[1]);
    v[2] = (v1[2] + v2[2]);

    m = 1.0f / sqrtf(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);

    v[0] *= m;
    v[1] *= m;
    v[2] *= m;
}
#else
#warning "Not using normalize"

void mid_v3_v3v3_slerp(float v[3], const float v1[3], const float v2[3])
{
    const float dot_product = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
    const float theta = acosf(dot_product);
    const float n = 1.0f / (2.0f * cosf(theta * 0.5f));

    v[0] = (v1[0] + v2[0]) * n;
    v[1] = (v1[1] + v2[1]) * n;
    v[2] = (v1[2] + v2[2]) * n;
}
#endif

int main(void)
{
    unsigned long long int i = 10000000000;
    const float v1[3] = {-0.8659117221832275, 0.4995948076248169, 0.024538060650229454};
    const float v2[3] = {0.7000154256820679, 0.7031427621841431, -0.12477479875087738};
    float v[3];

    while (--i) {
        asm (""); /* prevent compiler from optimizing the loop away */
        mid_v3_v3v3_slerp(v, v1, v2);
    }

    printf("done %f %f %f\n", v[0], v[1], v[2]);
    return 0;
}