A question about the details about the distribution from blocks to SMs in CUDA

https://stackoverflow.com/questions/7164210

09-01-2021
|

Pergunta

Let me take the hardware with computation ability 1.3 as an example.

30 SMs are available. Then at most 240 blocks are able to be running at the same time(Considering the limit of register and shared memory, the restriction to the number of block may be much lower). Those blocks beyond 240 have to wait for available hardware resources.

My question is when those blocks beyond 240 will be assigned to SMs. Once some blocks of the first 240 are completed? Or when all of the first 240 blocks are finished?

I wrote such a piece of code.

#include<stdio.h>
#include<string.h>
#include<cuda_runtime.h>
#include<cutil_inline.h>

const int BLOCKNUM = 1024;
const int N=240;
__global__ void kernel ( volatile int* mark ) {
    if ( blockIdx.x == 0 ) while ( mark[N] == 0 );
    if ( threadIdx.x == 0 ) mark[blockIdx.x] = 1;
}

int main() {
    int * mark;
    cudaMalloc ( ( void** ) &mark, sizeof ( int ) *BLOCKNUM );
    cudaMemset ( mark, 0, sizeof ( int ) *BLOCKNUM );
    kernel <<< BLOCKNUM, 1>>> ( mark );
    cudaFree ( mark );
    return 0;
}

This code causes a deadlock and fails to terminate. But if I change N from 240 to 239, the code is able to terminate. So I want to know some details about the scheduling of blocks.

Solução

On the GT200, it has been demonstrated through micro-benchmarking that new blocks are scheduled whenever a SM has retired all the currently active blocks which it was running. So the answer is when some blocks are finished, and the scheduling granularity is SM level. There seems to be a consensus that Fermi GPUs have a finer scheduling granularity than previous generations of hardware.

Outras dicas

I can't find any reference about this for compute capabilities < 1.3.

Fermi architectures introduce a new block dispatcher called GigaThread engine.
GigaThread enables immediate replacement of blocks on an SM when one completes executing and also enables concurrent kernel execution.

While there is no official answer to this, you can measure through atomic operations when your blocks begin your work and when they end.

Try playing with the following code:

#include <stdio.h>

const int maxBlocks=60; //Number of blocks of size 512 threads on current device required to achieve full occupancy

__global__ void emptyKernel() {}


__global__ void myKernel(int *control, int *output) {
        if (threadIdx.x==1) {
                //register that we enter
                int enter=atomicAdd(control,1);
                output[blockIdx.x]=enter;

                //some intensive and long task
                int &var=output[blockIdx.x+gridDim.x]; //var references global memory
                var=1;
                for (int i=0; i<12345678; ++i) {
                        var+=1+tanhf(var);
                }

                //register that we quit
                var=atomicAdd(control,1);
        }
}


int main() {

        int *gpuControl;
        cudaMalloc((void**)&gpuControl, sizeof(int));
        int cpuControl=0;
        cudaMemcpy(gpuControl,&cpuControl,sizeof(int),cudaMemcpyHostToDevice);


        int *gpuOutput;
        cudaMalloc((void**)&gpuOutput, sizeof(int)*maxBlocks*2);
        int cpuOutput[maxBlocks*2];

        for (int i=0; i<maxBlocks*2; ++i) //clear the host array just to be on the safe side
                cpuOutput[i]=-1;

        // play with these values
        const int thr=479;
        const int p=13;
        const int q=maxBlocks;

        //I found that this may actually affect the scheduler! Try with and without this call.
        emptyKernel<<<p,thr>>>();

        cudaEvent_t timerStart;
        cudaEvent_t timerStop;
        cudaEventCreate(&timerStart);
        cudaEventCreate(&timerStop);

        cudaThreadSynchronize();

        cudaEventRecord(timerStart,0);

        myKernel<<<q,512>>>(gpuControl, gpuOutput);

        cudaEventRecord(timerStop,0);
        cudaEventSynchronize(timerStop);

        cudaMemcpy(cpuOutput,gpuOutput,sizeof(int)*maxBlocks*2,cudaMemcpyDeviceToHost);

        cudaThreadSynchronize();
        float thisTime;
        cudaEventElapsedTime(&thisTime,timerStart,timerStop);

        cudaEventDestroy(timerStart);
        cudaEventDestroy(timerStop);
        printf("Elapsed time: %f\n",thisTime);

        for (int i=0; i<q; ++i)
                printf("%d: %d-%d\n",i,cpuOutput[i],cpuOutput[i+q]);
}

What you get in the output is the block ID, followed by the enter "time" and exit "time". This way you can learn in which order those events occured.

On Fermi, I'm sure that a block is scheduled on a SM as soon there is room for it. I.e., whenever, a SM finishes executing one block, it will execute another block if there is any block left. (However, the actual order is not deterministic).

In older versions, I don't know. But you can verify it by using the build-in clock() function.

For example, I used the following OpenCL kernel code (you can easily convert it to CUDA):

   __kernel void test(uint* start, uint* end, float* buffer);
   {
       int id = get_global_id(0);
       start[id] = clock();
       __do_something_here;
       end[id] = clock();
   }

Then output it to a file and build a graph. You will see how visual it is.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow