سؤال

I am designing a CUDA kernel that will be launched with 16 threads per thread block. I have an array of N ints in shared memory (i.e. per thread block) that I wish to process.

If the access pattern of the threads is consecutive into the array then does that mean there will be no bank conflicts? I understand that if the array was a char array there would be bank conflicts but I'm not entirely sure what happens if its an int array. I'm guessing there will be bank conflicts because each set of 4 consecutive ints share the same memory bank?

If this is true then what is the correct solution to prevent bank conflicts? Address scrambling like in the histogram sample?

هل كانت مفيدة؟

المحلول

For devices of compute capability >= 2.0, shared memory is arranged into 32-bit words that are interleaved. So, if each thread in a warp (a warp is 32 threads) addresses consecutive 32-bit words, there won't be any bank conflicts. Also, different threads can access the same 32-bit value without causing any bank conflicts. This means that there also won't be any bank conflicts if all threads read consecutive values from an array of chars.

Bank conflicts are really only caused by two or more threads addressing different 32-bit words that are a multiple of 32 addresses apart.

The answer to this may be different for other compute capabilities -- I haven't checked.

Note that 16 threads per block is very low. With a block size this low, I don't think you will be able to improve performance on the GPU vs. the CPU (unless this is only a small part of the total workload and the data is already in GPU memory).

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top