long double (80 bits) twice as fast as double with -funsafe-math-optimizations

https://stackoverflow.com/questions/20602599

02-09-2022
|

Question

I'm getting consistently that using long double datatype is about twice as fast as using double for my calculations when using -funsafe-math-optimizations. I would like to have an insight on this, because the 80 bit format is deprecated since long, or I might be doing something really dumb with the double datatype. The compiler is g++ 4.8.2, target is x86_64 (so gcc would prefer SSE2 if I don't use long double).

My code is more or less like this (pseudocode):

//x is an array of floating point numbers
for i -> x.size
        accumulator = 0
        for k -> kmax
            accumulator += A[k]*(B[k]*cos(C*k*x[i]) - D[k]*sin(C*k*x[i]));
        x[i] += F*accumulator;
        if(x[i] >= 1/2) x[i] -= integer(x[i]+1/2);
        else if(x[i] < -1/2) x[i] -= integer(x[i]-1/2);

A, B, .. are some precomputed arrays/constants.

The speedup seems unrelated to cacheline problems, because I get the same relative speedup if I parallelize the outer for loop with OpenMP.

EDIT: I corrected the pseudocode: notice the cos and sin have the same argument, that is in the end the reason for the speedup (see gsg's answer and the comments).

Solution

My guess is that the difference is due to cos.

The long double math must be compiled into x87 instructions, making it easy and efficient to use the x87 operation fcos. However there are no transcendental operations for the xmm registers, so a call to cos must either generate code to move a double onto the x87 stack and invoke fcos, or make a function call to do the equivalent work. These are, presumably, more expensive for this compiler and machine.

You could try to verify this by looking at the assembly - look for call cos or x87 instructions - and it might also be worth compiling with -mfpmath=387 to see if the performance characteristics change.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow