If each piece of data take 128 bit or more, is there any advantage of grouping them in memory?

https://stackoverflow.com/questions/18093024

23-06-2022
|

Question

I've read in the CUDA Programming Guide that the global memory in a CUDA device is accessed by transaction on 32, 64 or 128 bit. Knowing that, is there any advantage of, say, having an set of float4 (128 bit) close together in memory? As I understand it, whether the float4 are distributed in memory or in a sequence, the number of transaction will be the same. Or will all access be coalesced in one gigantic transaction?

Solution

Coalescing refers to combining memory requests from individual threads in a warp into a single memory transaction.

A single memory transaction is typically a 128 byte cache line, therefore it would consist of eight 128 bit (e.g. float4) quantities.

So, yes, there is a benefit to having multiple threads requesting adjacent 128 bit quantities, because these can still be coalesced into a single (128 byte) cache line request to memory.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow