[Since you also asked the same question on the NVIDIA forums, http://devtalk.nvidia.com, I simply copied the answer I gave there to StackOverflow. In general, cross-references are helpful when questions are asked on multiple platforms.]
Latency is fairly meaningless with a throughput architecture like the GPU. The easiest way to determine throughput numbers for whatever operation you are interested in is to measure it on the device you plan to target. As far as I know, this is how the tables are generated for the CPU document you referenced.
To examine the machine code, you can disassemble the machine code (SASS) for the modulo operation using cuobjdump --dump-sass. When I do this for sm_20, I count a total of sixteen instructions for a 32/32->32 bit unsigned modulo. From the instruction mix, I would estimate the throughput to be around 20 billion operations per second on a Tesla C2050, across the entire GPU (note that this is a guesstimate, not a measured number!).
As for the 64/64->64 bit unsigned modulo, which is a called subroutine, I recently measured a throughput of 6.4 billion operations per second on a C2050 using CUDA 5.0.
You might want to look into the algorithms of Montgomery and Barrett for modular multiplications, instead of using division.