Question

I am calling CUDA kernels from matlab.

I was previously told that ( David Kirk's book) one could only take 16kb of shared memory per thread, but I am able to consume far more than that:

__global__ void plain(float* arg)
{

    __shared__ float array[12000];
    int k;

    for (k=1;k<12000; k++)
    {
        array[k]=1;
    }   
}

CUDA C reports that a float is 4 bytes, meaning that total array size is 48Kb which is greater than 12Kb. It runs fine, so how can this be?

I am also told in GPU shared memory size is very small - what can I do about it? that the max shared mem per block is important. Max shared memory per block for my card is 49152 bytes, yet I am able to run the above code with 1000 threads per block.

it seems like it would use 49Kb per block, which can't be right. Is it that the SM only services one block at once and in dong preserves the condition that there can only be 49Kb per thread block?

How is 49Kb shared mem per block reconciled with 16Kb shared memory per thread?

Thanks

Was it helpful?

Solution

Shared memory is allocated per thread block, with as much as 48 KB available per SM with compute capability 2.0 and up. So on a given SM you could be running a single thread block that consumes the entire 48 KB or, say, three thread blocks each of which allocates 16 KB. The limit of 16 KB of shared memory per SM applies to compute capabilities < 2.0. As opposed to shared memory, which is allocated per thread block, local memory ("local" meaning "thread local") is allocated per thread.

OTHER TIPS

Threads don't have shared memory. Your code uses "block" shared memory (there is no other shared memory in CUDA)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top