cuda-gdb Kernel breakpoint not working

https://stackoverflow.com/questions/14934323

10-03-2022
|

Question

For some reason, the breakpoints I set in a specific kernel are completely ignored... I have checked the error status with cudaGetLastError(), which told me that everything ran fine so I am quite sure this should mean that the kernel has executed. Placing printf statements also yields no extra information, as nothing is printed. Even in a kernel that is entered in debug mode, the printf calls have no effect. What could go wrong here?!

We are running Cuda 4.2 on a Tesla M2075 (driver version 295.41). Output when debugging:

(cuda-gdb) break cudaCalcBeamIntersect
Breakpoint 1 at 0x401cfb: file cudacalcbeamintersect.cu, line 109.
(cuda-gdb) r
Starting program: /home/heit/cuda/vfind/vfind singleevent.txt 1 1 1 
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff5dd5700 (LWP 20241)]
[Context Create of context 0x634220 on Device 0]
[Launch of CUDA Kernel 0 (memset32_post<<<(64,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 1 (memset32_post<<<(8,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 2 (memset32_post<<<(64,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 3 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 4 (memset32_post<<<(1,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 5 (memset32_post<<<(8,1,1),(64,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 6 (cudaInitializeGlobals<<<(256,1,1),(128,1,1)>>>) on Device 0]
no error
[Launch of CUDA Kernel 7 (cudaCalcBeamIntersect<<<(256,1,1),(128,1,1)>>>) on Device 0]
no error
Elapsed time: 0.876842 seconds.
[Thread 0x7ffff5dd5700 (LWP 20241) exited]
[Termination of CUDA Kernel 6 (cudaInitializeGlobals<<<(256,1,1),(128,1,1)>>>) on Device 0]

Program exited normally.

The "no error" prints are printed outside the kernels by calling cout << cudaGetErrorString(cudaGetLastError()) << '\n';, and indicate that both cudaInitializeGlobals() (which can be stepped through in cuda-gdb) and cudaCalcBeamIntersect() are executed without problems. The latter however, cannot be debugged.

The kernel in question is still a preliminary one, and calculates some values to be stored in (static) global memory. Nothing else is done with these values, so could it be that the compiler optimizes this call away completely? If so, why??!! And how to prevent this behavior?? (-O0 has no effect)

Cheers!

Edit - The code:

** Code calling the kernels **

    uint const nEvents = events.size();     // total number of events

    /* Not important ... */

// Allocate memory to hold the events
    Track *dev_events;                      
    cudaMalloc(&dev_events, linearEvents.size() * sizeof(Track));

// Copy all events to the GPU
    cudaMemcpy(dev_events, &linearEvents[0], linearEvents.size() * sizeof(Track), cudaMemcpyHostToDevice);

// Initialize the global data, like the histogram and the array of z-values
    cudaInitializeGlobals <<< tpb, bpg >>> ();
    cout << cudaGetErrorString(cudaGetLastError()) << '\n';

    cout << "Processing " << nEvents << " event(s)\n";
    uint linearIdx = 0;
    for (uint event = 0; event != nEvents; ++event)
    {
        uint nTracks = events[event].size();

        if (nTracks > MAX_NUMBER_OF_TRACKS)
        {
            cout << "Number of tracks in event " << event << " exceeds maximum number of tracks.\n";
            exit(1);
        }

        cudaCalcBeamIntersect <<< tpb, bpg >>> (dev_events + linearIdx, nTracks, bipThresh, binWidth);
        cout << cudaGetErrorString(cudaGetLastError()) << '\n';

    // Update linear index
        linearIdx += nTracks;
    }

cudacalcbeamintersect.cu

#include "vfind.cuh"

__device__ float    dev_zMin;
__device__ float    dev_zMax;
__device__ float    dev_zValues[MAX_NUMBER_OF_TRACKS];
__device__ uint     dev_histogram[MAX_NUMBER_OF_BINS];

__constant__ Track dev_beam = 
{
    {0, 0, 1},
    {0, 0, 0}
};

__global__ void cudaInitializeGlobals()
{
    uint const tid = threadIdx.x + blockIdx.x * blockDim.x;
    uint const nThreads = blockDim.x * gridDim.x;

    if (tid == 0)
    {
        dev_zMin = 1e6;
        dev_zMax = -1e6;
    }

    uint idx = tid;
    while (idx < MAX_NUMBER_OF_BINS || idx < MAX_NUMBER_OF_TRACKS)          
    {
        if (idx < MAX_NUMBER_OF_BINS)
            dev_histogram[idx] = 0;

        if (idx < MAX_NUMBER_OF_TRACKS)
            dev_zValues[idx] = 0;

        idx += nThreads;
    }
}

__device__ float dot(float const v1[3], float const v2[3])
{
    // Stuff
}

__device__ float distance(Track const &t1, Track const &t2)
{
    // Even more boring unimportant stuff
}

__device__ Vertex vertex(Track const &t1, Track const &t2)
{
    // Yet even more boring unimportant stuff
}

__global__ void cudaCalcBeamIntersect(Track const *tracks, uint nTracks, float bipTresh, float binWidth)
{
    uint const tid = threadIdx.x + blockIdx.x * blockDim.x;
    uint const nThreads = blockDim.x * gridDim.x;

    uint idx = tid;
    while (idx < nTracks)
    {
        float dist = distance(tracks[idx], dev_beam);
        if (dist < bipTresh)
        {
            float z = vertex(tracks[idx], dev_beam).z;

            if (z < dev_zMin)
                atomicExch(&dev_zMin, z);

            if (z > dev_zMax)
                atomicExch(&dev_zMax, z);

            dev_zValues[idx] = z;
        }

        idx += nThreads;
    }

    __syncthreads();

    // To be continued here
}

Solution

@JorenHeit Your kernel cudaCalcBeamIntersect has global memory side effects and should not be getting optimized out. Based on the posted cuda-gdb output, it looks like the host thread that had launched the work is not waiting on the work to complete (via a cudaDeviceSynchronize() call or via a cudaMemcpy from device to host). As a result, the host thread is exiting before the cudaCalcBeamIntersect kernel could be executed on the GPU. Please try adding a cudaDeviceSynchronize() call after every kernel launch in your application.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow