Question

I have a dynamic memory allocation inside my kernel:

float MyLongArray1 = new float [array_size]
float MyLongArray2 = new float [array_size]

where array_size is taken from kernel call. array_size is on the order of 100000, so quite high.

Memory allocation seems to be working fine. Then I try to do something with the both arrays

for(int i=0; i<array_size; i++)
{
    for(int j=0; j<array_size; j++)
    {
          do some calculations;
    }
    MyLongArray1[i]=calculation_result1;
    MyLongArray2[i]=calculation_result2;
}

The code I've written works fine on 1 core and up to 15 cores. However, when I do 16 cores then I get GPUassert: unspecified launch failure. cuda-memcheck still gives 0 errors though. I have made some experiments. When I comment one of the MyLongArray2[i]=calculation_result2;, the code works again. When I make array_size half of the previous case, I can increase the number of cores by 2 times. It looks like dynamic allocation takes much more memory? I am on fermi with 3Gb of memory, so my arrays should fit into global memory fine.

What would be possible solutions in this case? Should I avoid dynamic memory allocation for CUDA applications?

Was it helpful?

Solution

In all likelihood, you're exceeding the size of the heap on the device. You can use a cuda API call to fix this.

cudaDeviceSetLimit(cudaLimitMallocHeapSize, n*100000*sizeof(float));

Make sure you do this before any kernel call though. With that said, you should strongly consider using cudaMalloc once to allocate a single large array instead of doing this.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top