CUDA: Unspecified Launch Failure

https://stackoverflow.com/questions/11478218

20-06-2021
|

Question

I was using the CUDA-GDB to find out what the problem was with my kernel execution. It would always output; Cuda error: kernel execution: unspecified launch failure. That's probably the worst error anyone could possibly get because there is no indication whatsoever of what is going on!

Back to the CUDA-GDB... When I was using the debugger it would arrive at the kernel and output; Breakpoint 1, myKernel (__cuda_0=0x200300000, __cuda_1=0x200400000, __cuda_2=320, __cuda_3=7872, __cuda_4=0xe805c0, __cuda_5=0xea05e0, __cuda_6=0x96dfa0, __cuda_7=0x955680, __cuda_8=0.056646065580379823, __cuda_9=-0.0045986640087569072, __cuda_10=0.125, __cuda_11=18.598229033761132, __cuda_12=0.00048828125, __cuda_13=5.9604644775390625e-08) at myFunction.cu:60

Then I would type: next.

output; 0x00007ffff7f7a790 in __device_stub__Z31chisquared_LogLikelihood_KernelPdS_iiP12tagCOMPLEX16S1_S1_S_dddddd () from /home/alex/master/opt/lscsoft/lalinference/lib/liblalinference.so.3

The notable part in that section is that it has a tag to a typedef'd datatype. COMPLEX16 is defined as: typedef double complex COMPLEX16

Then I would type: next. output; Single stepping until exit from function Z84_device_stub__Z31chisquared_LogLikelihood_KernelPdS_iiP12tagCOMPLEX16S1_S1_S_ddddddPdS_iiP12tagCOMPLEX16S1_S1_S_dddddd@plt, which has no line number information. 0x00007ffff7f79560 in ?? () from /home/alex/master/opt/lscsoft/lalinference/lib/liblalinference.so.3

Type next... output; Cannot find bounds of current function

Type continue... Cuda error: kernel execution: unspecified launch failure.

Which is the error I get without debugging. I have seen some forum topics on something similar where the debugger cannot find the bounds of current function, possibly because the library is somehow not linked or something along those lines? The ?? was said to be because the debugger is somewhere is shell for some reason and not in any function.

I believe the problem lies deeper in the fact that I have these interesting data types in my code. COMPLEX16 REAL8

Here is my kernel...

__global__ void chisquared_LogLikelihood_Kernel(REAL8 *d_temp, double *d_sum, int lower, int dataSize,
        COMPLEX16 *freqModelhPlus_Data,
        COMPLEX16 *freqModelhCross_Data,
        COMPLEX16 *freqData_Data,
        REAL8 *oneSidedNoisePowerSpectrum_Data,
        double FplusScaled,
        double FcrossScaled,
        double deltaF,
        double twopit,
        double deltaT,
        double TwoDeltaToverN)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    __shared__ REAL8 ssum[MAX_THREADS];

    if (idx < dataSize)
    {
        idx += lower; //accounts for the shift that was made in the original loop

        memset(ssum, 0, MAX_THREADS * sizeof(*ssum));

        int tid = threadIdx.x;
        int bid = blockIdx.x;

        REAL8 plainTemplateReal = FplusScaled * freqModelhPlus_Data[idx].re
            + freqModelhCross_Data[idx].re;
        REAL8 plainTemplateImag = FplusScaled * freqModelhPlus_Data[idx].im
            + freqModelhCross_Data[idx].im;
        /* do time-shifting...             */
        /* (also un-do 1/deltaT scaling): */
        double f = ((double) idx) * deltaF;

        /* real & imag parts of  exp(-2*pi*i*f*deltaT): */
        double re = cos(twopit * f);
        double im = - sin(twopit * f);

        REAL8 templateReal = (plainTemplateReal*re - plainTemplateImag*im) / deltaT;
        REAL8 templateImag = (plainTemplateReal*im + plainTemplateImag*re) / deltaT;
        double dataReal     = freqData_Data[idx].re / deltaT;
        double dataImag     = freqData_Data[idx].im / deltaT;
        /* compute squared difference & 'chi-squared': */
        double diffRe       = dataReal - templateReal;         // Difference in real parts...
        double diffIm       = dataImag - templateImag;         // ...and imaginary parts, and...
        double diffSquared  = diffRe*diffRe + diffIm*diffIm ;  // ...squared difference of the 2 complex figures.


        //d_temp[idx - lower] = ((TwoDeltaToverN * diffSquared) / oneSidedNoisePowerSpectrum_Data[idx]);

        //ssum[tid] = ((TwoDeltaToverN * diffSquared) / oneSidedNoisePowerSpectrum_Data[idx]);

        /*****   REDUCTION    *****/

        //__syncthreads(); //all the temps should have data before we add them up

        //for (int i = blockDim.x / 2; i > 0; i >>= 1) { /* per block */
        //  if (tid < i)
        //     ssum[tid] += ssum[tid + i];

        //  __syncthreads();
        //}

        //d_sum[bid] = ssum[0];

    }
}

When I'm not debugging (-g -G not included in command) then the kernel only runs fine if I don't include the line(s) that begin with d_temp[idx - lower] and ssum[tid]. I only did d_temp to make sure that it wasn't a shared memory error, ran fine. I also tried running with ssum[tid] = 20.0 and other various number types to make sure it wasn't that sort of problem, ran fine too. When I run with either of them included then the kernel exits with the cuda error above.

Please ask me if something is unclear or confusing.

Solution

There was a lack of context here for my question. The assumption was probably that I had done cudaMalloc and other such preliminary things before the kernel execution for ALL the pointers involved. However I had only done it to d_temp and d_sum (I was making tons of switches and barely realized I was making the other four pointers). Once I did cudaMalloc and cudaMemcpy for the data needed, then everything ran perfectly.

Thanks for the insight.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow