A rather unusual error.
The failure is occurring because:
- By specifying a virtual architecture only (
-arch compute_11
) you defer the PTX compile step until runtime (i.e. you are forcing JIT-compile) - The JIT-compile is failing (at runtime)
- The failure of the JIT-compile (and link) means device symbols cannot be properly established
- Due to the problem with device symbols, the operation
cudaGetSymbolAddress
on the device symboldev_base
fails, and throws an error.
Why is the JIT-compile failing? You can find out yourself by triggering the machine code compile (which runs the ptxas
assembler) by specifying -arch=sm_11
instead of -arch compute_11
. If you do that, you'll get this result:
ptxas error : Entry function '_Z4findPjS_' uses too much local data (0x10100 bytes, 0x4000 max)
So even though your code doesn't call the find
kernel, it must compile successfully to have a sane device environment for symbols.
Why does this compile error occur? Because you are requesting too much local memory per thread. cc 1.x devices are limited to 16KB local memory per thread, and your find
kernel is requesting quite a bit more than that (over 64KB).
When I initially tried it on my device, I was using a cc2.0 device which has a higher limit (512KB per thread) and so the JIT-compile step succeeded.
In general, I would recommend specifying both a virtual architecture and a machine architecture, and the shorthand way to do that is:
nvcc -arch=sm_11 ....
(for a cc1.1 device)
This question/answer may also be of interest, and the nvcc manual has more details about virtual vs. machine architecture, and how to specify the compilation phases for each.
I believe the reason the error goes away when you comment out those particular lines in the kernel, is that with those commented out, the compiler is able to optimize-out the accesses to those local memory areas, and optimize-out the instantiation of the local memory. This allows the JIT-compile step to complete successfully, and your code runs "without runtime error".
You can verify this by commenting those lines out and then specify a full compile (nvcc -arch=sm_11 ...
), where -arch
is short for --gpu-architecture
.