Question

First off, my question isn't worded correctly; I think it's better if I ask using an example found in NVidia's CUDA C Programming guide.

In section 3.2.3 (Shared Memory), the following code is given for Matrix Multiplication using shared memory--I hope it's okay for me to copy it here.

__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;

// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;

// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;

// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {

    // Get sub-matrix Asub of A
    Matrix Asub = GetSubMatrix(A, blockRow, m);

    // Get sub-matrix Bsub of B
    Matrix Bsub = GetSubMatrix(B, m, blockCol);

    // Shared memory used to store Asub and Bsub respectively
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

    // Load Asub and Bsub from device memory to shared memory
    // Each thread loads one element of each sub-matrix
    As[row][col] = GetElement(Asub, row, col);
    Bs[row][col] = GetElement(Bsub, row, col);

    // Synchronize to make sure the sub-matrices are loaded
    // before starting the computation
    __syncthreads();

    // Multiply Asub and Bsub together
    for (int e = 0; e < BLOCK_SIZE; ++e)
        Cvalue += As[row][e] * Bs[e][col];

    // Synchronize to make sure that the preceding
    // computation is done before loading two new
    // sub-matrices of A and B in the next iteration
    __syncthreads();
}

// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}

On the 7th line: Matrix Csub = GetSubMatrix(C, blockRow, blockCol), will every thread execute that statement? Wouldn't that nullify the whole point of using shared memory to reduce the amount of global memory accesses? I'm under the impression that there's something fundamental that I'm missing here..

Also, there's certainly a better way to word this question. I just don't know how!

Thanks,

Zakiir

Was it helpful?

Solution

Each thread executes the same instruction at the same time (or is idle), so every thread goes into GetSubMatrix yes. Each thread takes a few items. So if there are N threads and 3N items to be copied each thread will copy 3.

For example, if I were copying a vector, I might do the following

float from* = ???;
float to*   = ???;
int   num   = ???;
int   thread = threadIdx.x + threadIdx.y*blockDim.x ...; // A linear index
int   num_threads = blockDim.x * blockDim.y * blockDim.z;
for(int i=threadIdx.x; i < num; i+= num_threads) {
     to[i] = from[i];
}

Every thread is involved in copying one bit at a time. As an aside: if you can manage to get all the threads to copy a sequential bunch of elements you get bonus speed in the copy.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top