Question

I am trying to call two kernels as shown below

for (t=0; t<=time_total; t++)
{    
      //kernel calls
      kernel1<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
      checkCudaError(cudaThreadSynchronize());

      kernel2<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
      checkCudaError(cudaThreadSynchronize());

}

And the structure of the second kernel is

var[index+0]=**SOME CALCULATION**
var[index+1]=**SOME CALCULATION**
var[index+2]=**SOME CALCULATION**

Now when I execute this code, checkCudaError does not report anything and the code is executed giving some output but visual studio gives the following exception

First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..

And when I check on Nsight it says kernel 2 is having the following error

CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

Now the problem is that var array in kernel 2 is giving some of the rows correct some are copies of other row values and some are garbage.

Also when I do this

var[index+0]=3
var[index+1]=3
var[index+2]=3

All the values of var are set to 3

Was it helpful?

Solution

A few side notes:

  1. cudaThreadSynchronize() is deprecated in favor of cudaDeviceSynchronize().
  2. The fact that nsight is reporting an error on the 2nd kernel launch, but your error checking code is not, leads me to believe your error checking code is broken.

Now, regarding your issue, out of resources is frequently due to a code requesting too many registers (too many registers per thread times the number of threads per threadblock requested.) Try re-compiling your code specifying -Xptxas -v to get verbose output, and then recompiling again with -maxrregcount 20 (or something like that) to try to work around this for test purposes.

If this "fixes" your problem, you may then want to consider the following:

  1. See if there is a way you can re-order or restructure your code to reduce the register pressure
  2. If not, then adjust your maxrregcount value upwards to approximately the highest value that will allow your code to compile and run according to the launch configurations (number of threads per block) that you care about. You may also want to benchmark your code at different levels of this setting, as it can affect occupancy. Usually if you have it set to the highest value that will compile and run, then you are limiting yourself to one threadblock per SM at execution time. This may be OK, or there may be a lower setting that is better, allowing two threadblocks per SM residency, and possibly higher performance. Only benchmarking your code will tell.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top