My guess is that the difference is due to cos
.
The long double
math must be compiled into x87 instructions, making it easy and efficient to use the x87 operation fcos
. However there are no transcendental operations for the xmm
registers, so a call to cos must either generate code to move a double
onto the x87 stack and invoke fcos
, or make a function call to do the equivalent work. These are, presumably, more expensive for this compiler and machine.
You could try to verify this by looking at the assembly - look for call cos
or x87 instructions - and it might also be worth compiling with -mfpmath=387
to see if the performance characteristics change.