Is there a more efficient implementation which is callable from inside the kernels?
CUB provides a CUDA reduction primitive compatible with dynamic parallelism, namely, that can be called within kernels.
Question
I see reductions algorithms in CUDA (such as summation and maximization over a range of elements) discussed in previous posts, but with dynamic parallelism, they could potentially be implemented in a different way. Is there a more efficient implementation which is callable from inside the kernels?
Solution
Is there a more efficient implementation which is callable from inside the kernels?
CUB provides a CUDA reduction primitive compatible with dynamic parallelism, namely, that can be called within kernels.