CUDA compute capability 1.0 faster than 3.5

Question 1

About two years ago i switched my simulation from CUDA3.2 to CUDA4.0 and experienced a performance hit of about 10%. With Compute Capability 2.0 nVidia introduced IEEE754-2008 conform floating point calculation (CC 1.0 used IEEE754-1985). This, and the removal of "flush to zero" was the reason for the performance hit. Try compile your CC 3.0 executable with compiler flag --use_fast_math. This enables the old preciseness of CC 1.0.

Question 2

Note that 680 cannot run SM 3.5 code - only 3.0. Only Titan can run SM 3.5.

I have two quite different explanations for the difference you are seeing:

No GPU code is actually executed. This can happen if you compile "GPU" 1.0 (as opposed to "PTX"). Make sure you check error values from all CUDA RT calls.
In some fairly rare cases code compiled to PTX 1.0 will run faster after being JITed to 3.0 then code compiled directly to 3.0. This is caused by different compilers used to emit GPU/PTX 1.0 code and SM 2+ code. Note that in majority of cases code emitted by the 2+ compiler is faster - but there were reports of the opposite for some code patterns.

Update

Apparently, your code needs a lot of registers and compiling for 3.0 allocates more registers (as this architecture has higher register number) limiting the occupancy.

You can try playing with your block size and/or cap the number of registers used by your code. There is hard to make any suggestions without seeing your code and experimenting with the profiler. I would also suggest you to try CUDA toolkit 5.5 when it becomes available - compiler may make different tradeoffs improving the performance of your code.

Question 3

I was dealing with the same question.

As it appears the cuda compute capability index (1.0 2.0 2.1 3.0 3.5 etc) is an indicator for the type of opperations the cuda card can handle. (see: http://en.wikipedia.org/wiki/CUDA Version features and specifications, the part with the red and green colored table).

Another thing is the computation power of each cuda card. wich depends on the number and type of gpus and the ram speed etc.

So there could be a card that is "only" cuda 3.0 such as the gtx-760, with 1152 cores and cuda 3.0, and there could be another card the gt 640, 384 cores, but cuda 3.5)

The only code possible to compare on both devices, has to be 3.0 compatible, and would probably run much faster on the gtx 760, eventough that one only has 3.0 and the 640 3.5

i think they should make it a bit more clear that the compute capability index is not so much about speed, what most people think but about capability.