Questions about parallelism on GPU (CUDA)

https://stackoverflow.com/questions/22246449

10-06-2023
|

Question

I need to give some details about what I am doing before asking my question. I hope my English and my explanations are clear and concise enough.

I am currently working on a massive parallelization of an initially written C code. The reason I was interested in CUDA is the large sizes of the arrays I was dealing with : the code is a simulation of fluid mechanics and I needed to launch a "time loop" with five to six successive operations on arrays as big as 3.10^9 or 19.10^9 double variables. I went through various tutorials and documentation and I finally managed to write a not-so-bad CUDA code.

Without going through the details of the code, I used relatively small 2D-blocks. The number of threads is 18 or 57 (which is awkwardly done since my wraps are not fully occupied).

The kernels call a "big" 3D-grid, which describes my physical geometry (the maximal desired size is 1000 value per dimension, that means I want to deal with a 3D grid with a 1 billion blocks).

Okay so now, my five to six kernels which are doing correctly the job are making good use of the shared memory advantages, since global memory is read ounce and written ounce for each kernel (the size of my blocks was actually determined in accordance with the adequate needed amount of shared memory).

Some of my kernels are launched concurrently, asynchronously called, but most of them need to be successive. There are several memcpy from device to host, but the ratio of memcpys over kernels calls is significantly low. I am mostly executing operations on my arrays values.

Here is my question :

If I understood correctly, all of my blocks are doing the job on the arrays at the same time. So that means dealing with a 10-blocks grid, a 100-blocks grid or a billion will take the same amount of time? The answer is obviously no, since the compuation time is significantly more important when I am dealing with large grids. Why is that?

I am using a relatively modest NVIDIA device (NVS 5200M). I was trying to get used to CUDA before getting bigger/more efficient devices.

Since I went through all the optimization and CUDA programming advices/guides by myself, I may have completely misunderstood some points. I hope my question is not too naive...

Thanks!

Solution

If I understood correctly, all of my blocks are doing the job on the arrays at the same time.

No they don't run at the same time! How many thread blocks can run concurrently depends on several things, all effected on the compute capability of your device - NVS 5200M should be cc2.1. A CUDA enabled gpu has an internal scheduler, that manages where and when which thread block and warps of the blocks will run. Where means on which streaming multiprocessor (SM) the block will be launched.

Every SM has a limited amount of resources - shared memory and registers for example. A good overview for these limitations gives the Programming Guide or the Occupancy Calculator. The first limitation is, that for cc2.1 a SM can run up to 8 thread blocks at the same time. Depending on your usage of registers, shared memory... the number will possible decrease.

If I remind me right a SM of cc2.1 exists of 96 cuda cores and therefore your NVS 5200M should have one SM. Let's assume with your kernel setup N (N<=8) thread blocks fit into the SM at the same time. The internal scheduler will be launched the first N blocks and queue up all other thread blocks. If one thread block has finished his work, the next one from the queue will be launched. So if you will launch in total 1 until N blocks, the used time for the kernel will be very equal. If you run the kernel with N+1 blocks, than the used time will be increased.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow