CUDA cudaMalloc failed after running kernel with huge static arrays

https://stackoverflow.com/questions/20079046

31-07-2022
|

Question

I have a kernel which uses about 2GB local memory. My cudaMalloc which tries to alloc 2.5GB memory fails if I run that kernel_func before.

I found out that 2GB memory is still occupied after finished running kernel_func, which leaves only 1.5GB for my cudaMalloc. Does anyone has a solution or explanation?

I know that using global memory for kernel_func can solve the problem but for some reason I need to use local memory for that huge local static array.

__global__ kernel_func() {

    // The huge static array goes here
    short my_array[50000];

}

int main() {

    kernel_func<<<64, 128>>>();
    // my_array is still occupying memory are this point

    // This cudaMalloc will fail with insufficient memory
    cudaMalloc(/* 2.5GB data */);
}

Solution

If you allow kernel_func to finish (e.g. with cudaDeviceSynchronize()), then I doubt that my_array is still "occupying memory" as you suggest, after the kernel completes, i.e. at the point of this comment:

    // my_array is still occupying memory are this point

You could be more certain with a call to cudaMemGetInfo() at that point.

Nevertheless, it's likely what you're experiencing is memory fragmentation of some sort.

The only way I know of to "clean the slate" would be a call to cudaDeviceReset() at that point. However that will kill any operations as well as any allocations on the GPU, so you should only do it when you have no other activity going on with the GPU, and you must re-allocate any GPU data that you need after the call to cudaDeviceReset().

Certainly if you can arrange your allocations using cudaMalloc instead, that might be easier.

Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow