Question

In CUDA programming guide it is stated that atomic operations on mapped pinned host memory "are not atomic from the point of view of the host or other devices." What I get from this sentence is that if the host memory region is accessed only by one GPU, it is fine to do atomic on the mapped pinned host memory (even from within multiple simultaneous kernels).

On the other hand, in the book the CUDA Handbook by Nicholas Wilt at page 128 it is stated that:

Do not try to use atomics on mapped pinned host memory, either for the host (locked compare-exchange) or the device (atomicAdd()). On the CPU side, the facilities to enforce mutual exclusion for locked operations are not visible to peripherals on the PCI express bus. Conversely, on the GPU side, atomic operations only work on local device memory locations because they are implemented using the GPU's local memory controller.

Is is safe to do atomic from inside a CUDA kernel on mapped pinned host memory? Can we rely on PCI-e bus to keep the atomicity of atomics' read-modify-write?

Was it helpful?

Solution

The caution is intended for people who are using mapped pinned memory to coordinate execution between the CPU and GPU, or between multiple GPUs. When I wrote that, I did not expect anyone to use such a mechanism in the single-GPU case because CUDA provides so many other, better ways to coordinate execution between the CPU(s) and a single GPU.

If there is strictly a producer/consumer relationship between the CPU and GPU (i.e. the producer is updating the memory location and the consumer is passively reading it), that can be expected to work under certain circumstances.

If the GPU is the producer, the CPU would see updates to the memory location as they get posted out of the GPU’s L2 cache. But the GPU code may have to execute memory barriers to force that to happen; and even if that code works on x86, it’d likely break on ARM without heroic measures because ARM does not snoop bus traffic.

If the CPU is the producer, the GPU would have to bypass the L2 cache because it is not coherent with CPU memory.

If the CPU and GPU are trying to update the same memory location concurrently, there is no mechanism to ensure atomicity between the two. Doing CPU atomics will ensure that the update it atomic with respect to CPU code, and doing GPU atomics will ensure that the update is atomic with respect to the GPU that is doing the update.

All of the foregoing discussion assumes there is only one GPU; if multiple GPUs are involved, all bets are off. Although atomics are provided for in the PCI Express 3.0 bus specification, I don’t believe they are supported by NVIDIA GPUs. And support in the underlying platform also is not guaranteed.

It seems to me that whatever a developer may be trying to accomplish by doing atomics on mapped pinned memory, there’s probably a method that is faster, more likely to work, or both.

OTHER TIPS

Yes, this works atomically from a single GPU. So if no other CPU or GPU is accessing the memory it will be atomic. Atomics are implemented in the L2 cache and the CROP (on various GPUs), and both can handle system memory accesses.

It will be slow, though. This memory is not cached on the GPU.

When Nick says, "the facilities to enforce mutual exclusion for locked operations are not visible to peripherals on the PCI express bus", it makes me think he's referring to the lack of atomicity when accessing that memory from both processors, which is correct.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top