
I am having an error while executing the kernel

too many resources requested for launch

I checked online for any hints on error message, which suggest this happens due to usage of more registers than the limit specified by the GPU for each multi-processsor. Device query results as follows:

Device 0: "GeForce GTX 470"
CUDA Driver Version / Runtime Version          5.0 / 5.0
CUDA Capability Major/Minor version number:    2.0
Total amount of global memory:                 1279 MBytes (1341325312 bytes)
(14) Multiprocessors x ( 32) CUDA Cores/MP:    448 CUDA Cores
GPU Clock rate:                                1215 MHz (1.22 GHz)
Memory Clock rate:                             1674 Mhz
Memory Bus Width:                              320-bit
L2 Cache Size:                                 655360 bytes
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 32768
Warp size:                                     32
Maximum number of threads per multiprocessor:  1536
Maximum number of threads per block:           1024
Maximum sizes of each dimension of a block:    1024 x 1024 x 64
Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535

Update Robert Crovella remarked that he had no problems in running the code, so I paste here the complete code snippet for execution.

Complete code looks like this:

__global__ void calc_params(double *d_result_array, int total_threads) {

        int thread_id             = threadIdx.x + (blockDim.x * threadIdx.y);
        d_result_array[thread_id] = 1 / d_result_array[thread_id];


  void calculate() {

     double *h_array;
     double *d_array;

     size_t array_size = pow((double)31, 2) * 2 * 10;

     h_array = (double *)malloc(array_size * sizeof(double));
     cudaMalloc((void **)&d_array, array_size * sizeof(double));

     for (int i = 0; i < array_size; i++) {
        h_array[i] = i;

     cudaMemcpy(d_array, h_array, array_size * sizeof(double), cudaMemcpyHostToDevice);

     int BLOCK_SIZE = 1024;
     int NUM_OF_BLOCKS = (array_size / BLOCK_SIZE) + (array_size % BLOCK_SIZE)?1:0;

     calc_params<<<NUM_OF_BLOCKS, BLOCK_SIZE>>>(d_array, array_size);



When I execute this code, I get the error as, too many resources requested for launch

While instead of using the inverse statement in the kernel
(i.e. d_result_array[thread_id] = 1 / d_result_array[thread_id])
the equate statement works perfectly
(i.e. d_result_array[thread_id] = d_result_array[thread_id] * 200) .

Why? Is there any possible alternative to that (other than using a smaller block size). If thats the only solution, how shall I know what should be the block size that can work.


P.S. For those who are might wanna know whats cudaCheckErrors is

#define checkCudaErrors(val) check( (val), #val, __FILE__, __LINE__)

template<typename T>
void check(T err, const char* const func, const char* const file, const int line) {
  if (err != cudaSuccess) {
    std::cerr << "CUDA error at: " << file << ":" << line << std::endl;
    std::cerr << cudaGetErrorString(err) << " " << func << std::endl;

Build and OS Information

Build of configuration Debug for project TEST

make all 
Building file: ../
Invoking: NVCC Compiler
nvcc -G -g -O0 -gencode arch=compute_20,code=sm_20 -odir "" -M -o "test_param.d" "../"
nvcc --compile -G -O0 -g -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_20  -x cu -o  "test_param.o" "../"
Finished building: ../

Building target: TEST
Invoking: NVCC Linker
nvcc  -link -o  "TEST"  ./test_param.o   
Finished building target: TEST

Operating System

Ubuntu Lucid (10.04.4) 64bit
Linux paris 2.6.32-46-generic #105-Ubuntu SMP Fri Mar 1 00:04:17 UTC 2013 x86_64 GNU/Linux

Error I receive

CUDA error at: ../
too many resources requested for launch cudaGetLastError()
This seems to be an artifact of the compiler. The problem seems to be the register usage, which you can observe by passing the -Xptxas -v option on the nvcc command line. For some reason the -G version of the code uses quite a bit more registers (per thread) than the regular code. You have a few options:

  1. Don't use the -G switch. This switch should only be used for debug purposes anyway, as it generates code that may run slower than without the -G switch.
  2. If you want to use the -G switch, then reduce the number of threads per block. For the example in this case, I was able to get it to run with 768 threads per block or less.
  3. Instruct the compiler to use fewer registers per thread. You can do this with the -maxrregcount switch, such as:

    nvcc -Xptxas -v -arch=sm_20 -G -maxrregcount=20 -o t145

The objective in this last case is to have the (registers per thread * threads per block) be less than the max registers per block for the GPU in use. A typical CC 2.0 GPU has a maximum of 32768 registers available per block (which you can discover with the deviceQuery sample).

