Over theoretical peak FLOPS benchmark

Question

According to Intel® Intrinsics Guide _mm_mul_ps, _mm_add_ps, _mm_sub_ps have Throughput=1 for your CPUID 06_17 (as you noted).

In different sources I saw different throughput meanings. In some places it was clock/instruction, in others it was the inverse (of course, while we have 1 - it does not matter).

According to "Intel® 64 and IA-32 Architectures Optimization Reference Manual" definition of Throughput is:

Throughput — The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many instructions, the throughput of an instruction can be significantly less than its latency.

According to "C.3.2 Table Footnotes":

— The FP_ADD unit handles x87 and SIMD floating-point add and subtract operation.

— The FP_MUL unit handles x87 and SIMD floating-point multiply operation.

I.e. additions/subtractions and multiplications are executed on different execution units.

FP_ADD and FP_MUL execution units are connected to different Dispatch Ports (at bottom of picture):

Intel Core microarchitecture (wikipedia)

The scheduler can dispatch instructions to several ports every cycle.

Multiplication and addition execution units can perform operations in parallel. So theoretical GFLOPS on one core of your processor is:

sse_packet_size = 4
instructions_per_cycle = 2
clock_rate_ghz = 2.66
sse_packet_size * instructions_per_cycle * clock_rate_ghz = 21.28GFLOPS

So, you are closely approaching the theoretical peak with your 18.4GFLOPS.

_Mandelbrot function has 3 instructions for FP_ADD and 3 for FP_MUL. As you can see within function there are many data-dependencies, so instructions cannot be interleaved efficiently. I.e, in order to feed FP_ADD with some operations, FP_MUL should execute at least two operations in order to produce operands required for FP_ADD.

But hopefully, your inner for loop has many operations without dependencies:

for (int j = 0; j < 1000000; ++j)
{
    _Mandelbrot(x6_Re, x6_Im, x1_Re, x1_Im, c_Re, c_Im); // 1
    _Mandelbrot(x1_Re, x1_Im, x2_Re, x2_Im, c_Re, c_Im); // 2
    _Mandelbrot(x2_Re, x2_Im, x3_Re, x3_Im, c_Re, c_Im); // 3
    _Mandelbrot(x3_Re, x3_Im, x4_Re, x4_Im, c_Re, c_Im); // 4
    _Mandelbrot(x4_Re, x4_Im, x5_Re, x5_Im, c_Re, c_Im); // 5
    _Mandelbrot(x5_Re, x5_Im, x6_Re, x6_Im, c_Re, c_Im); // 6
}

Only sixth operation depends on output of first. Instructions of all other operations can be interleaved freely with each other (by both - compiler and processor), which would allow to keep busy both FP_ADD and FP_MUL units.

P.S. Just for test, you can try to replace all add/sub operations with mul in Mandelbrot function or vice versa - and you will get only ~half of current FLOPS.