It turns out that #StoreLoad
is exactly the right barrier for this situation. As explained simply by Jeff Preshing:
A StoreLoad barrier ensures that all stores performed before the barrier are visible to other processors, and that all loads performed after the barrier receive the latest value that is visible at the time of the barrier.
In C++11, std::atomic_thread_fence(std::memory_order_seq_cst)
apparently acts as a #StoreLoad
barrier (as well as the other three: #StoreStore
, #LoadLoad
, and #LoadStore
). See this C++11 draft paper.
Side note: On x86, the mfence
instruction acts as a #StoreLoad
; this can generally be emitted with the _mm_fence()
compiler intrinsic if need be.
So a pattern for lock-free code might be:
Initialize:
CPU 1: setupStuff();
CPU 1: std::atomic_thread_fence(std::memory_order_seq_cst);
Run parallel stuff
Uninitialize:
CPU 2: std::atomic_thread_fence(std::memory_order_seq_cst);
CPU 2: teardownStuff();