Atomic operations in CUDA kernels on mapped pinned host memory: to do or not to do?

Question 1

The caution is intended for people who are using mapped pinned memory to coordinate execution between the CPU and GPU, or between multiple GPUs. When I wrote that, I did not expect anyone to use such a mechanism in the single-GPU case because CUDA provides so many other, better ways to coordinate execution between the CPU(s) and a single GPU.

If there is strictly a producer/consumer relationship between the CPU and GPU (i.e. the producer is updating the memory location and the consumer is passively reading it), that can be expected to work under certain circumstances.

If the GPU is the producer, the CPU would see updates to the memory location as they get posted out of the GPU’s L2 cache. But the GPU code may have to execute memory barriers to force that to happen; and even if that code works on x86, it’d likely break on ARM without heroic measures because ARM does not snoop bus traffic.

If the CPU is the producer, the GPU would have to bypass the L2 cache because it is not coherent with CPU memory.

If the CPU and GPU are trying to update the same memory location concurrently, there is no mechanism to ensure atomicity between the two. Doing CPU atomics will ensure that the update it atomic with respect to CPU code, and doing GPU atomics will ensure that the update is atomic with respect to the GPU that is doing the update.

All of the foregoing discussion assumes there is only one GPU; if multiple GPUs are involved, all bets are off. Although atomics are provided for in the PCI Express 3.0 bus specification, I don’t believe they are supported by NVIDIA GPUs. And support in the underlying platform also is not guaranteed.

It seems to me that whatever a developer may be trying to accomplish by doing atomics on mapped pinned memory, there’s probably a method that is faster, more likely to work, or both.

Question 2

Yes, this works atomically from a single GPU. So if no other CPU or GPU is accessing the memory it will be atomic. Atomics are implemented in the L2 cache and the CROP (on various GPUs), and both can handle system memory accesses.

It will be slow, though. This memory is not cached on the GPU.

When Nick says, "the facilities to enforce mutual exclusion for locked operations are not visible to peripherals on the PCI express bus", it makes me think he's referring to the lack of atomicity when accessing that memory from both processors, which is correct.