The caution is intended for people who are using mapped pinned memory to coordinate execution between the CPU and GPU, or between multiple GPUs. When I wrote that, I did not expect anyone to use such a mechanism in the single-GPU case because CUDA provides so many other, better ways to coordinate execution between the CPU(s) and a single GPU.
If there is strictly a producer/consumer relationship between the CPU and GPU (i.e. the producer is updating the memory location and the consumer is passively reading it), that can be expected to work under certain circumstances.
If the GPU is the producer, the CPU would see updates to the memory location as they get posted out of the GPU’s L2 cache. But the GPU code may have to execute memory barriers to force that to happen; and even if that code works on x86, it’d likely break on ARM without heroic measures because ARM does not snoop bus traffic.
If the CPU is the producer, the GPU would have to bypass the L2 cache because it is not coherent with CPU memory.
If the CPU and GPU are trying to update the same memory location concurrently, there is no mechanism to ensure atomicity between the two. Doing CPU atomics will ensure that the update it atomic with respect to CPU code, and doing GPU atomics will ensure that the update is atomic with respect to the GPU that is doing the update.
All of the foregoing discussion assumes there is only one GPU; if multiple GPUs are involved, all bets are off. Although atomics are provided for in the PCI Express 3.0 bus specification, I don’t believe they are supported by NVIDIA GPUs. And support in the underlying platform also is not guaranteed.
It seems to me that whatever a developer may be trying to accomplish by doing atomics on mapped pinned memory, there’s probably a method that is faster, more likely to work, or both.