According to Intel® Intrinsics Guide _mm_mul_ps
, _mm_add_ps
, _mm_sub_ps
have Throughput=1
for your CPUID 06_17
(as you noted).
In different sources I saw different throughput meanings. In some places it was clock/instruction
, in others it was the inverse (of course, while we have 1
- it does not matter).
According to "Intel® 64 and IA-32 Architectures Optimization Reference Manual" definition of Throughput
is:
Throughput
— The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many instructions, the throughput of an instruction can be significantly less than its latency.
According to "C.3.2 Table Footnotes":
— The FP_ADD unit handles x87 and SIMD floating-point add and subtract operation.
— The FP_MUL unit handles x87 and SIMD floating-point multiply operation.
I.e. additions/subtractions and multiplications are executed on different execution units.
FP_ADD
and FP_MUL
execution units are connected to different Dispatch Ports (at bottom of picture):
The scheduler can dispatch instructions to several ports every cycle.
Multiplication and addition execution units can perform operations in parallel. So theoretical GFLOPS on one core of your processor is:
sse_packet_size = 4
instructions_per_cycle = 2
clock_rate_ghz = 2.66
sse_packet_size * instructions_per_cycle * clock_rate_ghz = 21.28GFLOPS
So, you are closely approaching the theoretical peak with your 18.4GFLOPS.
_Mandelbrot
function has 3 instructions for FP_ADD and 3 for FP_MUL. As you can see within function there are many data-dependencies, so instructions cannot be interleaved efficiently. I.e, in order to feed FP_ADD with some operations, FP_MUL should execute at least two operations in order to produce operands required for FP_ADD.
But hopefully, your inner for
loop has many operations without dependencies:
for (int j = 0; j < 1000000; ++j)
{
_Mandelbrot(x6_Re, x6_Im, x1_Re, x1_Im, c_Re, c_Im); // 1
_Mandelbrot(x1_Re, x1_Im, x2_Re, x2_Im, c_Re, c_Im); // 2
_Mandelbrot(x2_Re, x2_Im, x3_Re, x3_Im, c_Re, c_Im); // 3
_Mandelbrot(x3_Re, x3_Im, x4_Re, x4_Im, c_Re, c_Im); // 4
_Mandelbrot(x4_Re, x4_Im, x5_Re, x5_Im, c_Re, c_Im); // 5
_Mandelbrot(x5_Re, x5_Im, x6_Re, x6_Im, c_Re, c_Im); // 6
}
Only sixth operation depends on output of first. Instructions of all other operations can be interleaved freely with each other (by both - compiler and processor), which would allow to keep busy both FP_ADD
and FP_MUL
units.
P.S. Just for test, you can try to replace all add
/sub
operations with mul
in Mandelbrot
function or vice versa - and you will get only ~half of current FLOPS.