Kepler CUDA dynamic parallelism and thread divergence

https://stackoverflow.com/questions/11390196

19-06-2021
|

質問

There is very little information on dynamic parallelism of Kepler, from the description of this new technology, does it mean the issue of thread control flow divergence in the same warp is solved?

It allows recursion and lunching kernel from device code, does it mean that control path in different thread can be executed simultaneously?

解決

Take a look to this paper

Dynamic parallelism, flow divergence and recursion are separated concepts. Dynamic parallelism is the ability to launch threads within a thread. This mean for example you may do this

__global__ void t_father(...)   {
   ...
   t_child<<< BLOCKS, THREADS>>>();
   ...
}

I personally investigated in this area, when you do something like this, when t_father launches the t_child, the whole vga resources are distributed again among those and t_father waits until all the t_child have finished before it can go on (look also this paper Slide 25)

Recursion is available since Fermi and is the ability for a thread to call itself without any other thread/block re-configuration

Regarding the flow divergence, I guess we will never see thread within a warp executing different code simultaneously..

他のヒント

No. Warp concept still exists. All the threads in a warp are SIMD (Single Instruction Multiple Data) that means at the same time, they run one instruction. Even when you call a child kernel, GPU designates one or more warps to your call. Have 3 things in your mind when you're using dynamic parallelism:

The deepest you can go is 24 (CC=3.5).
The number of dynamic kernels running at the same time is limited ( default 4096) but can be increased.
Keep parent kernel busy after child kernel call otherwise with a good chance you waste resources.

There's a sample cuda source in this NVidia presentation on slide 9.

__global__ void convolution(int x[])
{
   for j = 1 to x[blockIdx]
      kernel<<< ... >>>(blockIdx, j)
}

It goes on to show how part of the CUDA control code is moved to the GPU, so that the kernel can spawn other kernel functions on partial dompute domains of various sizes (slide 14).

The global compute domain and the partitioning of it are still static, so you can't actually go and change this DURING GPU computation to e.g. spawn more kernel executions because you've not reached the end of your evaluation function yet. Instead, you provide an array that holds the number of threads you want to spawn with a specific kernel.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow