CUDA: atomicAdd takes too much time, serializing threads

https://stackoverflow.com/questions/11597802

22-06-2021
|

Domanda

I have a kernel which makes some comparisons and decides whether two objects collide or not. I want to store the colliding objects' id's to an output buffer. I do not want to have gap in the output buffer. I want to record each collision to a unique index in the output buffer.

So I created an atomic variable in the shared memory (local sum), and also in global memory (global sum). The code below shows the incrementing of the shared variable as the collision is found. I do not have problem with incrementing atomic variable at global memory for now.

__global__ void mykernel(..., unsigned int *gColCnt) {
    ...

    __shared__ unsigned int sColCnt;
    __shared__ unsigned int sIndex;

    if (threadIdx.x == 0) {
        sColCnt = 0;
    }

    __syncthreads();

    unsigned int index = 0;
    if (colliding)
        index = atomicAdd(&sColCnt, 1); //!!Time Consuming!!

    __syncthreads();

    if (threadIdx.x == 0)
        sIndex = atomicAdd(gColCnt, sColCnt);

    __syncthreads();

    if (sColCnt + sIndex > outputSize) { //output buffer is not enough
        //printf("Exceeds outputsize: %d + %d > %d\n", sColCnt, sIndex, outputSize);
        return;
    }

    if (colliding) {
        output[sIndex + index] = make_uint2(startId, toId);
    }
}

My problem is that, when many threads try to increment the atomic variable, they get serialized. Before writing something like prefix-sum, I wanted to ask if there is a way of getting this done efficiently.

The elapsed time of my kernel increases from 13msec to 44msec because of this one line out there.

I found a prefix-sum example code but its referenced links fails because of NVIDIA's discussing board is down. https://stackoverflow.com/a/3836944/596547

Edit: I have added the end of my code too to above. In fact I do have an hierarchy. To see the affect of every code line, I setup scenes where every object collides with each other, extreme case, and another extreme case where approximately no objects collide.

At the end I add the shared atomic variable to a global variable (gColCnt) to inform outside about the number of collisions and find correct index values. I think I have to use atomicAdd here in any way.

Soluzione

Consider using a parallel stream compaction algorithm, for instance thrust::copy_if.

Altri suggerimenti

nvidia blog article related : http://devblogs.nvidia.com/parallelforall/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow