CUDA Convex Hull program crashes on large input

https://stackoverflow.com/questions/7102166

17-12-2020
|

سؤال

I am trying to implement quickHull algorithm (for convex hull) parallely in CUDA. It works correctly for input_size <= 1 million. When I try 10 million points, the program crashes. My graphic card size is 1982 MB and all my data structures in the algorithm collectively require not more than 600 MB for this input size, which is less than 50 % of the available space.

By commenting out lines of my kernels, I found out that the crash occurs when I try to access array element and the index of the element I am trying to access is not out of bounds (double checked). The following is the kernel code where it crashes.

for(unsigned int i = old_setIndex; i < old_setIndex + old_setS[tid]; i++) 
{

    int pI = old_set[i];
    if(pI <= -1 || pI > pts.size())
    {               
        printf("Thread %d: i = %d, pI = %d\n", tid, i, pI);
        continue;
    }
    p = pts[pI];

    double d = distance(A,B,p);

    if(d > dist) {
        dist = d;
        furthestPoint = i;
        fpi = pI;
    }
}
//fpi = old_set[furthestPoint]; 
//printf("Thread %d: Furthestpoint = %d\n", tid, furthestPoint);

My code crashes when I uncomment the statements (array access and printf) after the for loop. I am unable to explain the error as furthestPoint is always within bounds of old_set array size. Old_setS stores the size of smaller arrays that each thread can operate on. It crashes even if just try to print the value of furthestPoint (last line) without the array access statement above it.

There's no problem with the above code for input size <= 1 million. Am I overflowing some buffer in the device in case of 10 million?

Please help me in finding the source of the crash.

المحلول

There is no out of bounds memory access in your code (or at least not one which is causing the symptoms you are seeing).

What is happening is that your kernel is being killed by the display driver because it is taking too much time to execute on your display GPU. All CUDA platform display drivers include a time limit for any operation on the GPU. This exists to prevent the display from freezing for a sufficiently long time that either the OS kernel panics or the user panics and thinks the machine has crashed. On the windows platform you are using, the time limit is about 2 seconds.

What has partly mislead you into thinking the source of the problem is array adressing is the commenting out of code makes the problem disappear. But what really happens there is an artifact of compiler optimization. When you comment out a global memory write, the compiler recognizes that the calculations which lead to the value being stored are unused, and it removes all that code from the assembler code it emits (google "nvcc dead code removal" for more information). That has the effect of making the code run much faster and puts it under the display driver time limit.

For workarounds see this recent stackoverflow question and answer

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow