GT200 Single Precision Peak Performance

Question

Latency and throughput are not the same thing.

A cc 1.x SM can retire 8 single precision floating point MAD operations on every clock cycle.

This is the correct formula:

1296 MHz(cycle/s) * 30 SM * (8 SP/SM  * 2 flop/cycle per SP + 2 SFU/SM * 4 FPU/SFU * 1 flop/cycle per FPU)

= 622080 Mflop/s + 311040 Mflop/s = 933 GFlop/s single precision

From here

EDIT: The 4-cycle latency you're referring to is the latency of a warp (i.e. 32 threads) MAD instruction, as issued to the SM, not the latency of a single MAD operation on a single SP. The FPU in each SP can generate one MAD result per clock, and there are 8 SP's in one SM, so each SM can generate 8 MAD results per clock. Since a warp (32 threads) MAD instruction requires 32 MAD results, it requires 4 total clocks to complete the warp instruction, as issued to the SPs in the SM.

The FPU in the SP can generate one new MAD result per clock. From the standpoint of instruction issue, the fundamental unit is the warp. Therefore a warp MAD instruction requires 4 clocks to complete.

EDIT2: Responding to question below.

Preface: The FPUs in the SFU are not independently schedulable. They only come into play when an instruction is scheduled to the SFUs. There are 4 FPU per SFU, and an SFU instruction requires 16 cycles (since there are 2 SFU/SM) to complete for a warp. If all 4 FPU in both SFUs were fully utilized, that would be 128 (16x4x2) flops produced during the computation of the SFU instruction, in 16 cycles. This is added to the 256 (16x2x8) total flops that could be generated by the "regular" MAD FPUs in the SM during the same time (16 cycles).

Your question seems to be interpreting the observed benchmark result and this statement in the text:

Table III also shows that the throughput for single-precision floating point multiplication is 11.2 ops/clock, which means that multiplication can be issued to both the SP and SFU units. This suggests that each SFU unit is capable of doing 2 multiplications per cycle, twice the throughput of other (more complex) instructions that map to this unit.

as an indication of either the throughput of the FPUs in the SFU or else the number of FPUs in the SFU. However you are conflating benchmark data with a theoretical number. The SFU has 4 FPU, but this does not mean that all 4 are independently schedulable for arbitrary arithmetic or instruction streams. Seeing all 4 FPU take on a new floating point instruction in a given cycle may require a specific instruction sequence that the authors haven't used.