Memory coalescing in global writes

Question 1

Coalesced writes (or lack thereof) can affect performance, just as coalesced reads (or lack thereof) can.

A coalesced read occurs when a read request triggered by a warp instruction, e.g.:

int i = my_int_data[threadIdx.x+blockDim.x*blockIdx.x];

can be satisified by a single read transaction in the memory controller (which is essentially saying all the individual thread reads are coming from a single cache line.)

A coalesced write occurs when a write request triggered by a warp instruction, e.g.:

my_int_data[threadIdx.x+blockDim.x*blockIdx.x] = i;

can be satisfied by a single write transaction in the memory controller.

For the above examples I have shown, there are no differences generationally.

But there are other types of reads or writes that could coalesce (i.e. collapse to a single memory controller transaction) in later devices but not in earlier devices. One example is a "broadcast read":

int i = my_int_data[0];

In the above example, all threads read from the same global location. In newer devices, such a read would be "broadcast" to all threads in a single transaction. In some earlier devices, this would result in a serialized servicing of threads. Such an example probably has no corollary in writes, because multiple threads writing to a single location gives undefined behavior. However a "scrambled" write may coalesce on newer devices but not older:

my_int_data[(threadIdx.x+5)%32] = i;

Note that all the writes above are unique (within the warp) and belonging to an individual cache line, but they do not satisfy the coalescing requirements on 1.0 or 1.1 devices, but should on newer devices.

If you read the global memory access description for devices of cc 1.0 and 1.1, and compare to later devices, you will see some of the requirements for coalescing on earlier devices that have been relaxed on later devices.

Question 2

We did this experiment in a course I conducted. Coalescing turned out to be moderately more important in writes than in reads perhaps because the L1 and L2 caches store some of the unused data for later use.