CUDA: Why accessing the same device array is not coalesced?

https://stackoverflow.com/questions/19464786

01-07-2022
|

Question

I am posting a drilled down code for review. I believe it should compile and execute without any problems but since i excluded all the irrelevant parts, I might have made some mistake.

struct Users {
    double A[96];
    double B[32];
    double C[32];
};

This is my Users structure with fixed length arrays. Below is given the main function.

int main(int argc, char **argv) {

    int numUsers = 10;
    Users *users = new Users[numUsers];
    double Step[96];

    for (int i = 0; i < 32; i++) {
        Step[i]      = 0.8;
        Step[i + 32] = 0.8;
        Step[i + 64] = 0.8;
    }

    for (int usr = 0; usr < numUsers; usr++) {
        for (int i = 0; i < 32; i++) {
            users[usr].A[i]      = 10;
            users[usr].A[i + 32] = 20;
            users[usr].A[i + 64] = 30;
        }
        memset(users[usr].B, 0, sizeof(double) * 32);
        memset(users[usr].C, 0, sizeof(double) * 32);
    }


    double *d_Step;
    cudaMalloc((void**)&d_Step, sizeof(double) * 96);
    cudaMemcpy(d_Step, Step, sizeof(double) * 96, cudaMemcpyHostToDevice);


    Users *deviceUsers;
    cudaMalloc((void**)&deviceUsers, sizeof(Users) * numUsers);
    cudaMemcpy(deviceUsers, users, sizeof(Users) * numUsers, cudaMemcpyHostToDevice);


    dim3 grid;
    dim3 block;

    grid.x = 1;
    grid.y = 1;
    grid.z = 1;
    block.x = 32;
    block.y = 10;
    block.z = 1;
    calc<<<grid, block >>> (deviceUsers, d_Step, numUsers);

    delete users;
    return 0;
}

Please note here that Step array is 1D array with 96 bins and I am spanning 10 warps (32 threads in x direction and there are 10 of these in my block). Each warp will access the same Step array. This can be seen below in the kernel.

__global__ void calc(Users *users, double *Step, int numUsers) {

    int tId = threadIdx.x + blockIdx.x * blockDim.x;
    int uId = threadIdx.y;

    while (uId < numUsers) {

        double mean00 = users[uId].A[tId]      * Step[tId];
        double mean01 = users[uId].A[tId + 32] * Step[tId + 32];
        double mean02 = users[uId].A[tId + 64] * Step[tId + 64];

        users[uId].A[tId]      = (mean00 == 0? 0 : 1 / mean00);
        users[uId].A[tId + 32] = (mean01 == 0? 0 : 1 / mean01);
        users[uId].A[tId + 64] = (mean02 == 0? 0 : 1 / mean02);

        uId += 10;
    }
}

Now when I use NVIDIA Visual Profiler, the coalesced retrieves are 47%. I further investigated and found out that Step array which is being accessed by each warp causes this problem. If i replace it with some constant, the accesses are 100% coalesced.

Q1) As I understand, coalesced accesses are linked to byte line i.e. byte lines has to be multiple of 32, whether they are integer, double byte lines. Why I am not getting coalesced accesses?

As per my knowledge, cuda whenever assigns a memory block in the device global memory it, it assigned an even address to it. Thus as long as the starting point + 32 location are accessed by a warp, the access should be coalesced. Am I correct?

Hardware

Geforce GTX 470, Compute Capability 2.0

Solution

Your kernel read Step 10 times from global memory. Although L1 cache can reduce the actual access to global mem, it still be treated as inefficient access pattern by the profiler.

My profiler names it 'global load efficiency'. It doesn't say if it is coalesced or not.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow