Why are GPU threads in CUDA and OpenCL allocated in a grid?

https://stackoverflow.com/questions/1342992

20-09-2019
|

Question

I'm just learning OpenCL, and I'm at the point when trying to launch a kernel. Why is it that the GPU threads are managed in a grid?

I'm going to read more about this in detail, but it would be nice with a simple explanation. Is it always like this when working with GPGPUs?

Solution

This is a common approach, which is used in CUDA, OpenCL and I think ATI stream.

The idea behind the grid is to provide a simple, but flexible, mapping between the data being processed and the threads doing the data processing. In the simple version of the GPGPU execution model, one GPU thread is "allocated" for each output element in a 1D, 2D or 3D grid of data. To process this output element, the thread will read one (or more) elements from the corresponding location or adjacent locations in the input data grid(s). By organizing the threads in a grid, it's easier for the threads to figure out which input data elements to read and where to store the output data elements.

This contrasts with the common multi-core, CPU threading model where one thread is allocated per CPU core and each thread processes many input and output elements (e.g. 1/4 of the data in a quad-core system).

OTHER TIPS

The simple answer is that GPUs are designed to process images and textures that are 2D grids of pixels. When you render a triangle in DirectX or OpenGL, the hardware rasterizes it into a grid of pixels.

I will invoke the classic analogy of putting a square peg in a round hole. Well, in this case the GPU is a very square hole and not as well rounded as GP (general purpose) would suggest.

The above explanations put forward the ideas of 2D textures, etc. The architecture of the GPU is such that all processing is done in streams with the pipeline being identical in each stream, so the data being processed need to be segmented like that.

One reason why this is a nice API is that typically you are working with an algorithm that has several nested loops. If you have one, two or three loops then a grid of one, two or three dimensions maps nicely to the problem, giving you a thread for the value of each index.

So values that you need in your kernel (index values) are naturally expressed in the API.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow