I was trying to verify the single precision peak performance of a reference GT200 card.

From http://www.realworldtech.com/gt200/9/, we have two facts about GT200 –

  1. The latency of the fastest operation for an SP core is 4 cycles.
  2. The SFU takes 4 cycles too to finish an operation.

Now, each SM has a total of 8 SPs and 2 SFUs, with each SFU having 4 FP multiply units and these SPs and SFUs can work at the same time as they are on two different ports as explained in their SM level diagrams. Each SP can perform MAD operation.

So, we are looking at 8 MAD operations and 8 MUL operations per 4 SP cycles. This gives us 16 + 8 = 24 operations per 4 SP clock cycles as MAD counts as 2 operations. Since 2 SP clock cycle counts as one shader clock, we have 24/2 = 12 operations per shader clock. For a reference GT200 card, shader clock = 1296 MHz/s.

Thus, the single precision peak performance must be = 1296 MHz/s * 30 SM * 12 operations per shader clock = 466.560 GFLOPS

This is exactly half of the GFLOPS as reported in the specs. So where am I going wrong?

Edit: After Robert’s pointer to the CUDA Programming Guide that says 8MADs/shader clock can be performed in a GT200 SM, I would have to question how latency and throughput relate to each other in this particular SM.

There is a latency of one OP / 4 SP cycles (as pointed out earlier), thus one MAD every 4 SP cycles, right? We have 8 SPs, so it becomes 8 MADs for every 4 SP cycles in an SM.

Since 2 SP cycles form one shader cycle, so we are left with => 8MADs per 2 shader clock cycles => 4 MADs per shader clock.

This doesn’t match with the 8MADs/shader clock from the Programming Guide. So, what am I doing wrong again?

有帮助吗?

解决方案

Latency and throughput are not the same thing.

A cc 1.x SM can retire 8 single precision floating point MAD operations on every clock cycle.

This is the correct formula:

1296 MHz(cycle/s) * 30 SM * (8 SP/SM  * 2 flop/cycle per SP + 2 SFU/SM * 4 FPU/SFU * 1 flop/cycle per FPU)

= 622080 Mflop/s + 311040 Mflop/s = 933 GFlop/s single precision

From here

EDIT: The 4-cycle latency you're referring to is the latency of a warp (i.e. 32 threads) MAD instruction, as issued to the SM, not the latency of a single MAD operation on a single SP. The FPU in each SP can generate one MAD result per clock, and there are 8 SP's in one SM, so each SM can generate 8 MAD results per clock. Since a warp (32 threads) MAD instruction requires 32 MAD results, it requires 4 total clocks to complete the warp instruction, as issued to the SPs in the SM.

The FPU in the SP can generate one new MAD result per clock. From the standpoint of instruction issue, the fundamental unit is the warp. Therefore a warp MAD instruction requires 4 clocks to complete.

EDIT2: Responding to question below.

Preface: The FPUs in the SFU are not independently schedulable. They only come into play when an instruction is scheduled to the SFUs. There are 4 FPU per SFU, and an SFU instruction requires 16 cycles (since there are 2 SFU/SM) to complete for a warp. If all 4 FPU in both SFUs were fully utilized, that would be 128 (16x4x2) flops produced during the computation of the SFU instruction, in 16 cycles. This is added to the 256 (16x2x8) total flops that could be generated by the "regular" MAD FPUs in the SM during the same time (16 cycles).

Your question seems to be interpreting the observed benchmark result and this statement in the text:

Table III also shows that the throughput for single-precision floating point multiplication is 11.2 ops/clock, which means that multiplication can be issued to both the SP and SFU units. This suggests that each SFU unit is capable of doing 2 multiplications per cycle, twice the throughput of other (more complex) instructions that map to this unit.

as an indication of either the throughput of the FPUs in the SFU or else the number of FPUs in the SFU. However you are conflating benchmark data with a theoretical number. The SFU has 4 FPU, but this does not mean that all 4 are independently schedulable for arbitrary arithmetic or instruction streams. Seeing all 4 FPU take on a new floating point instruction in a given cycle may require a specific instruction sequence that the authors haven't used.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top