Pregunta

I ran some CUDA code that updated an array of floats. I have a wrapper function like the one discussed in How can I compile CUDA code then link it to a C++ project? this question.

Inside my CUDA function I create a for loop like this...

int tid = threadIdx.x;
for(int i=0;i<X;i++)
{
     //code here
}

Now the issue is that if X is equal to the value of 100, everything works just fine, but if X is equal to 1000000, my vector does not get updated (almost as if the code inside the for loop does not get executed)

Now inside the wrapper function, if I call the CUDA function in a for loop, it still works just fine, (but is significantly slower for some reason than if I simply did the same process all on the CPU) like this...

for(int i=0;i<1000000;i++)
{
      update<<<NumObjects,1>>>(dev_a, NumObjects);
}

Does anyone know why I can loop a million times in the wrapper function but not simply call the CUDA "update" function once and then inside that function start a for loop of a million?

¿Fue útil?

Solución

You should be using cudaThreadSynchronize and cudaGetLastError after running this to see if there was some error. I imagine the first time, it timed out. This happens if the kernel takes a long time to complete. The card just gives up on it.

The second thing, the reason it takes much longer to execute, is because there is a set overhead time for each kernel launch. When you had the loop inside the kernel, you experienced this overhead once and ran the loop. Now you're experiencing it X times. The overhead is fairly small, but large enough that as much of the loop should be put inside the kernel as possible.

If X is particularly large, you might look into running as much of the loop in the kernel as possible until it completes in a safe amount of time, and then loop over these kernels.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top