Determining latency and throughput for modular multiplication on CPU and CUDA

https://stackoverflow.com/questions/13251496

27-11-2021
|

Question

I need to determine both latency and throughput for (unsigned) modular multiplication in CUDA and on CPU (i5 750).

For the CPU I found this document, pg 121, for the Sandy Bridge, I am not really sure which one I should refer to, however for the "MUL IMUL r32" I get 4 cycles for the latency and reciprocal throughput equal 2. Then a "DIV r64" has latency 30-94 and rec.thr. 22-76.

Worst case scenario:

latency 94+4
rec.thr. 76+2

Right? Althought I am using OpenSSL to perform them, I am pretty sure at lowest level they always run simple modular multiplications.

Regarding CUDA, currently I am performing modular multiplications in PTX: multiplying 2 32b number, saving result on a 64b register, loading a 32b modulo on a 64b register and then do a 64b modulo.

If you look here, pg 76, they say throughput on Fermi 2.x for 32b integer multiplication is 16 (per clock-cycle per MP). Regarding modulo, they just say: "below 20 instructions on devices of compute capability 2.x"...

what does it mean exactly? Worst case 20 cycles per modulo per MP of latency? And throughput? How many modulos per MP?

Edit:

And what about if I have a warp where only the first 16 threads of a warp have to perform a 32b multiplication (16 ones per cycle per MP). Will the GPU busy for one cycle or two, although the second half has to do nothing?

Solution

[Since you also asked the same question on the NVIDIA forums, http://devtalk.nvidia.com, I simply copied the answer I gave there to StackOverflow. In general, cross-references are helpful when questions are asked on multiple platforms.]

Latency is fairly meaningless with a throughput architecture like the GPU. The easiest way to determine throughput numbers for whatever operation you are interested in is to measure it on the device you plan to target. As far as I know, this is how the tables are generated for the CPU document you referenced.

To examine the machine code, you can disassemble the machine code (SASS) for the modulo operation using cuobjdump --dump-sass. When I do this for sm_20, I count a total of sixteen instructions for a 32/32->32 bit unsigned modulo. From the instruction mix, I would estimate the throughput to be around 20 billion operations per second on a Tesla C2050, across the entire GPU (note that this is a guesstimate, not a measured number!).

As for the 64/64->64 bit unsigned modulo, which is a called subroutine, I recently measured a throughput of 6.4 billion operations per second on a C2050 using CUDA 5.0.

You might want to look into the algorithms of Montgomery and Barrett for modular multiplications, instead of using division.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow