Cuda - selective memory store

https://stackoverflow.com/questions/12082206

27-06-2021
|

题

In my kernel, if a condition is met, I update an item of the output buffer

if (condition(input[i])) //?
    output[i] = 1;

otherwise the output may stay the same, having value of 0.

The density of updates are quite unpredictable, depending on the input. Furthermore which output location will be updated is also not known. (i may force them though, in some cases)

My question is, is it better to write all items, to achieve coalescing, or do a selective write?

output[i] = condition(input[i]); //?

Would you mind discussing your statements?

解决方案

Coalescing is achieved even if some threads in the warp do not participate in the load or store, as long as all participating threads satisfy the requirements of coalescing. So conditional writes should have no effect on memory throughput.

However, doing a conditional write may involve additional instructions due to involving a branch (this would probably explain, for example, the difference in performance measured by Eugene in his answer).

其他提示

On my setup kernel that does conditional set (option 1) runs for 1.727 us and option 2 1.399 us. This is my code (setConditional is the faster one):

__global__ void conditionalSet(unsigned int* array) {
    if ((threadIdx.x & 3) == 0) {
        array[threadIdx.x] = 1;
    }
}

__global__ void setConditional(unsigned int* array) {
    array[threadIdx.x] = (threadIdx.x & 3) == 0 ? 1 : 0;
}

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow