This behavior is typical for kernel launch failure. Make sure you check return codes of the CUDA calls. Note that for debugging you may want to add additional call cudaDeviceSynchronize
immediately after the kernel call and to check the return code from this call - it is the most precise way to obtain the cause of the asynchronous kernel launch failure.
Update: The code running outside of debugger but not in cuda-gdb most often is caused by trying to debug on a single-GPU system from graphical environment. cuda-gdb cannot share GPU with Xwindows as this would hang the OS.
You need to exit the graphical environment (e.g. quit X window) and debug from the console if your system only has one GPU.
If you have a multi-GPU system, then you should check your Xwindow configuration (Xorg.conf) so it does not use the GPU you reserve for debugging.