PyCUDA strange error cuLaunchKernel failed: invalid value

Question

The basic problem here is the size of test inside your kernel. As you have written it, every thread requires 1228800 bytes of local memory. The runtime must reserve that memory for every thread - so your code would require 750Mb of free memory to allocate for local memory on the device to support the 640 threads per block you are trying to launch. My guess is that your device doesn't have that amount of free memory.

The reason why the code you have shown works without the if statement is down to compiler optimisation - in that case test isn't actually used for anything and the compiler simply removes it from the code, which eliminates the huge local memory footprint of the kernel and allows it to run. When you uncomment the if statement, test determines the state of a global memory write, thus the compiler cannot optimise it away and the kernel requires a large amount local memory to run.

This is the compiler output I see with the kernel code as you posted it:

> nvcc -arch=sm_21 -Xptxas="-v" -m32 -c wnkr_py.cu
wnkr_py.cu
wnkr_py.cu(7): warning: variable "test" was set but never used

tmpxft_00000394_00000000-5_wnkr_py.cudafe1.gpu
tmpxft_00000394_00000000-10_wnkr_py.cudafe2.gpu
wnkr_py.cu
wnkr_py.cu(7): warning: variable "test" was set but never used

ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function '_Z6totaalPi' for 'sm_21'
ptxas : info : Function properties for _Z6totaalPi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 8 registers, 36 bytes cmem[0], 4 bytes cmem[16]
tmpxft_00000394_00000000-5_wnkr_py.cudafe1.cpp
tmpxft_00000394_00000000-15_wnkr_py.ii

Note the compiler warning and the stack frame size.

With the if statement active:

>nvcc -arch=sm_21 -Xptxas="-v" -m32 -c wnkr_py.cu
wnkr_py.cu
tmpxft_000017c8_00000000-5_wnkr_py.cudafe1.gpu
tmpxft_000017c8_00000000-10_wnkr_py.cudafe2.gpu
wnkr_py.cu
ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function '_Z6totaalPi' for 'sm_21'
ptxas : info : Function properties for _Z6totaalPi
    1228800 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 7 registers, 36 bytes cmem[0]
tmpxft_000017c8_00000000-5_wnkr_py.cudafe1.cpp
tmpxft_000017c8_00000000-15_wnkr_py.ii

Note the stack frame size changes to 1228800 bytes per thread.

My quick reading of the code suggests that test doesn't need to be anything like as large as you have defined it for the code to run, but I leave the required size as an exercise to the reader....