The result from your first experiment is interesting: "And I will get r1==0 && r2 == 0 sometimes while run thread1 and thread2 concurrently ....even strong memory model like intel cpu , load disordered before store still happen" but not only for the reasons you think. Atomics don't only prevent the processor and cache subsystem from reordering memory accesses, but the compiler as well. GCC 4.8 at Coliru optimizes this code to assembly with the load instructions before the stores:
_Z7thread1v:
.LFB326:
.cfi_startproc
movl y(%rip), %eax
movl $1, x(%rip)
movl %eax, r1(%rip)
ret
Even if the processor guaranteed memory ordering here, you need some kind of fencing to keep the compiler from screwing things up.
Your second program is ill-formed due to the use of memory_order_acq_rel
as the memory ordering for a store
. acquire
only makes sense for loads, and release
only for stores, so memory_order_acq_rel
is only valid as an ordering for atomic read-modify-write operations like exchange
or fetch_add
. Replacing m_o_a_r
with memory_order_release
achieves the semantics you want, and the assembly produced is again interesting:
_Z7thread1v:
.LFB332:
.cfi_startproc
movl $1, x(%rip)
movl y(%rip), %eax
movl %eax, r1(%rip)
ret
The instructions are exactly what we would expect to be generated, with no special fence instructions. The processor memory model is strong enough to provide the necessary ordering guarantees with plain-old mov
instructions. In this instance, atomics are only necessary to tell the compiler to keep its fingers out of the code.
Your third program is (technically) unpredictable despite generating the same assembly as the second:
_Z7thread1v:
.LFB332:
.cfi_startproc
movl $1, x(%rip)
movl y(%rip), %eax
movl %eax, r1(%rip)
ret
Although the results are the same this time, there's no guarantee that the compiler won't choose to reorder the instructions as it did for the first program. The result may change when you upgrade your compiler, or introduce other instructions, or for any other reason. If you start compiling on ARM, all bets are off ;) It's also interesting that despite relaxing the requirements in the source program, the generated assembler is the same. There's no way to relax the memory ordering outside the restrictions that the processor architecture puts in place.