Parallelize a method from inside a CUDA device function / kernel

https://stackoverflow.com/questions/21438481

04-10-2022
|

Question

I've got an already parallelized CUDA kernel that does some tasks which require frequent interpolation.

So there's a kernel

__global__ void complexStuff(...)

which calls one or more times this interpolation device function:

__device__ void interpolate(...)

The interpolation algorithm does an WENO interpolation successively over three dimensions. This is a highly parallelizable task which I urgently would like to parallelize!

It is clear that the kernel complexStuff() can easily be parallelized by calling it from host code using the <<<...>>> syntax. It is also important that complexStuff() is already parallelized.

But it's not clear to me how to parallelize something / create new threads from inside a CUDA device function ... is this even possible? Does anyone know?

Solution

You might want to consider Dynamic Parallelism (some resources here, here, and here) in order to call a CUDA kernel from inside another CUDA kernel. It requires your device compute capability to be 3.5 or higher. It comes with a number of restrictions and limitations that may degrade the performance (mentioned in 3rd link).
My suggestion is to first consider calling your CUDA kernel with complexStuff(...) amount of work multiplied by interpolate(...) amount work. In other words, statically guess what is the maximum parallel fine-grained Jobs you need to do. Then configure your kernel to perform those fine-grained jobs with block threads. Note that it's just a speculation without knowing your program code.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow