CUDA ptxas Error "function uses too much shared data"

https://stackoverflow.com/questions/23648525

22-07-2023
|

Question

I have never used CUDA or C++ before, but I am trying to get Ramses GPU from http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/download.html running.
Due to an error in the autogen.sh I used ./configure and got this one working.
So the makefile produced contains the following NVCC flags

 NVCCFLAGS = -gencode=arch=compute_10,code=sm_10  -gencode=arch=compute_11,code=sm_11 -gencode=arch=compute_13,code=sm_13  -gencode=arch=compute_20,code=sm_20  -gencode=arch=compute_20,code=compute_20 -use_fast_math -O3

But when I try to compile the program using make, I get multiple ptxas Errors:

Entry function '_Z30kernel_viscosity_forces_3d_oldPfS_S_S_iiiiiffff' uses too much    shared data (0x70d0 bytes + 0x10 bytes system, 0x4000 max)
Entry function '_Z26kernel_viscosity_forces_3dPfS_S_S_iiiiiffff' uses too much shared data (0x70d0 bytes + 0x10 bytes system, 0x4000 max)
Entry function '_Z32kernel_viscosity_forces_3d_zslabPfS_S_S_iiiiiffff9ZslabInfo' uses too much shared data (0x70e0 bytes + 0x10 bytes system, 0x4000 max)

I'm trying to compile this code on Linux with Kernel 2.6 and CUDA 4.2 (I try to do it in my university and they are not upgrading stuff regularly.) on two NVIDIDA C1060. I tried replacing the sm_10, sm_11 and sm_13 by sm_20, (I saw this fix here: Entry function uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max) - CUDA error) but that didn't fix my problem.
Do you have any suggestions? I can upload the Makefile as well as everything else, if you need it.
Thank you for your help!

Solution

The code you are compiling requires a static allocation of 28880 bytes (0x70d0) of shared memory per block. For compute capability 2.x and newer GPUs, this is no problem because they support up to 48kb of shared memory. However, for compute capability 1.x devices, the shared memory limit is 16kb (and up to 256 bytes of that can be consumed by kernel arguments). Because of this, the code cannot be compiled for compute 1.x devices and the compiler is generating an error telling you this. So the error comes from specifying sm_13/compute_13 to compiler. You can removed that and the build should work.

However, it gets worse. The Tesla C1060 is a compute capability 1.3 device. As a result, you will not be able to compile and run those kernels on your GPUs. There is no solution short of omitting those kernels from the build (if you don't need them), or back porting the code to the compute 1.x architecture. I have no idea whether that is feasible or not. Or finding more modern hardware to run the code on.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow