Question

I have a project written a few years ago that do compute N similar tasks in a row on a single CPU core.

These N tasks are completely independent so they could be computed in parallel.

However, the problem with these tasks is that the control flow inside each task differs much from one task to other, so the SIMT approach implemented in CUDA will more likely impede than help.

I came up with an idea to launch N blocks with 1 thread in each to break the warp dependency for threads.

Can anyone suggest a better way to optimise the computations in this situation, or point out possible pitfalls with my solution.

Était-ce utile?

La solution

You are right with your comment what causes and what is caused by divergence of threads in a warp. However, launching configuration mentioned by you (1 thread in each block) totally diminishes potential of GPU. Threads in a warp/half-warp is the maximal unit of threads that is eventually executed in parallel on a single multiprocessor. So, having one thread in the block and having 32 these blocks is actually as having 32 threads in the warp with different paths. First case is even worse because number resident blocks per multiprocessors is quite limited (8 or 16, depending on compute capability).

Therefore, if you want to fully exploit potential of GPU, keep in mind Jack's comment and try to reorganize threads so that threads of a single warp would follow equal execution path.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top