Question

I have a cuda program that I am running on a 680gtx, while testing different compiler options I noticed that:

  • compiling my code for compute capability 1.0 and sm 1.0 gives a runtime of 47ms

  • compiling my code for compute capability 3.5 ( also 2.0 ) and sm 3.0 gives a runtime of 60ms


what might be the reasons for such results?

I am compiling on nsight compiler on linux and CUDA 5.0 and my kernel is mostly memory bound.

thanks.


the commands:

cc 1.0

nvcc --compile -O0 -Xptxas -v -gencode arch=compute_10,code=compute_10 -gencode arch=compute_10,code=sm_10 -keep -keep-dir /tmp/debug -lineinfo -pg -v  -x cu -o  "BenOlaCuda/src/main.o" "../BenOlaCuda/src/main.cu"

cc 3.0

nvcc -lineinfo -pg -O0 -v -keep -keep-dir /tmp/debug -Xptxas -v -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -odir "BenOlaCuda/src" -M -o "BenOlaCuda/src/main.d" "../BenOlaCuda/src/main.cu"

some more info on compiling my kernel:

cc 1.0

ptxas info    : Compiling entry function '_Z15optimizePixelZ3tfPfS_S_S_tttttt' for 'sm_10'
ptxas info    : Used 40 registers, 68 bytes smem, 64 bytes cmem[1], 68 bytes lmem

cc 3.0

ptxas info    : Compiling entry function '_Z15optimizePixelZ3tfPfS_S_S_tttttt' for 'sm_30'
ptxas info    : Function properties for _Z15optimizePixelZ3tfPfS_S_S_tttttt
128 bytes stack frame, 100 bytes spill stores, 108 bytes spill loads
ptxas info    : Used 63 registers, 380 bytes cmem[0], 20 bytes cmem[2]
Was it helpful?

Solution

About two years ago i switched my simulation from CUDA3.2 to CUDA4.0 and experienced a performance hit of about 10%. With Compute Capability 2.0 nVidia introduced IEEE754-2008 conform floating point calculation (CC 1.0 used IEEE754-1985). This, and the removal of "flush to zero" was the reason for the performance hit. Try compile your CC 3.0 executable with compiler flag --use_fast_math. This enables the old preciseness of CC 1.0.

OTHER TIPS

Note that 680 cannot run SM 3.5 code - only 3.0. Only Titan can run SM 3.5.

I have two quite different explanations for the difference you are seeing:

  1. No GPU code is actually executed. This can happen if you compile "GPU" 1.0 (as opposed to "PTX"). Make sure you check error values from all CUDA RT calls.
  2. In some fairly rare cases code compiled to PTX 1.0 will run faster after being JITed to 3.0 then code compiled directly to 3.0. This is caused by different compilers used to emit GPU/PTX 1.0 code and SM 2+ code. Note that in majority of cases code emitted by the 2+ compiler is faster - but there were reports of the opposite for some code patterns.

Update

Apparently, your code needs a lot of registers and compiling for 3.0 allocates more registers (as this architecture has higher register number) limiting the occupancy.

You can try playing with your block size and/or cap the number of registers used by your code. There is hard to make any suggestions without seeing your code and experimenting with the profiler. I would also suggest you to try CUDA toolkit 5.5 when it becomes available - compiler may make different tradeoffs improving the performance of your code.

I was dealing with the same question.

As it appears the cuda compute capability index (1.0 2.0 2.1 3.0 3.5 etc) is an indicator for the type of opperations the cuda card can handle. (see: http://en.wikipedia.org/wiki/CUDA Version features and specifications, the part with the red and green colored table).

Another thing is the computation power of each cuda card. wich depends on the number and type of gpus and the ram speed etc.

So there could be a card that is "only" cuda 3.0 such as the gtx-760, with 1152 cores and cuda 3.0, and there could be another card the gt 640, 384 cores, but cuda 3.5)

The only code possible to compare on both devices, has to be 3.0 compatible, and would probably run much faster on the gtx 760, eventough that one only has 3.0 and the 640 3.5

i think they should make it a bit more clear that the compute capability index is not so much about speed, what most people think but about capability.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top