cuda - loop unrolling issue

https://stackoverflow.com/questions/15719515

30-03-2022
|

문제

I have a kernel with a #pragma unroll 80 and I'm running it with NVIDIA GT 285, compute capability 1.3, with grid architecture: dim3 thread_block( 16, 16 ) and dim3 grid( 40 , 30 ) and it works fine.

When I tried running it with NVIDIA GT 580, compute capability 2.0 and with the above grid architecture it works fine.

When I change the grid architecture on the GT 580 to

dim3 thread_block( 32 , 32 ) and dim3 grid( 20 , 15 ), thus producing the same number of threads as above, I get incorrect results.

If I remove #pragma unroll 80 or replace it with #pragma unroll 1 in GT 580 it works fine. If I don't then the kernel crashes.

Would anyone know why does this happen? Thank you in advance

EDIT: checked for kernel errors on both devices and I got the "invalid argument". As I searched for the causes of this error I found that this happens when the dimensions of the grid and the block exceed their limits. But this is not the case for me since I use 16x16=256 threads per block and 40x30=1200 total blocks. As far as I know these values are in the boundaries of the GPU grid for compute capability 1.3. I would like to know if this could have anything to do with the loop unrolling issue I have.

해결책

I figured out what the problem was.

After some bug fixes I got the "Too Many Resources Requested for Launch" error. For a loop unroll, extra registers per thread are needed and I was running out of registers, hence the error and the kernel fail. I needed 22 registers per thread, and I have 1024 threads per block.

By inserting my data into the CUDA_Occupancy_calculator it showed me that 1 block per SM is scheduled, leaving me with 32678 registers for a whole block on the compute capability 2.0 device.

22 registers*1024 threads = 22528 registers<32678 which should have worked. But I was compiling with nvcc -arch sm_13 using the C.C. 1.3 characteristic of 16384 registers per SM

I compiled with nvcc -arch sm_20 taking advantage of the 32678 registers, more than enough for the needed 22528, and it works fine now. Thanks to everyone, I learned about kernel errors.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow