memory modeling test in c++11 , curious for memory_order_relaxed

Question 1

The result from your first experiment is interesting: "And I will get r1==0 && r2 == 0 sometimes while run thread1 and thread2 concurrently ....even strong memory model like intel cpu , load disordered before store still happen" but not only for the reasons you think. Atomics don't only prevent the processor and cache subsystem from reordering memory accesses, but the compiler as well. GCC 4.8 at Coliru optimizes this code to assembly with the load instructions before the stores:

_Z7thread1v:
.LFB326:
    .cfi_startproc
    movl    y(%rip), %eax
    movl    $1, x(%rip)
    movl    %eax, r1(%rip)
    ret

Even if the processor guaranteed memory ordering here, you need some kind of fencing to keep the compiler from screwing things up.

Your second program is ill-formed due to the use of memory_order_acq_rel as the memory ordering for a store. acquire only makes sense for loads, and release only for stores, so memory_order_acq_rel is only valid as an ordering for atomic read-modify-write operations like exchange or fetch_add. Replacing m_o_a_r with memory_order_release achieves the semantics you want, and the assembly produced is again interesting:

_Z7thread1v:
.LFB332:
    .cfi_startproc
    movl    $1, x(%rip)
    movl    y(%rip), %eax
    movl    %eax, r1(%rip)
    ret

The instructions are exactly what we would expect to be generated, with no special fence instructions. The processor memory model is strong enough to provide the necessary ordering guarantees with plain-old mov instructions. In this instance, atomics are only necessary to tell the compiler to keep its fingers out of the code.

Your third program is (technically) unpredictable despite generating the same assembly as the second:

_Z7thread1v:
.LFB332:
    .cfi_startproc
    movl    $1, x(%rip)
    movl    y(%rip), %eax
    movl    %eax, r1(%rip)
    ret

Although the results are the same this time, there's no guarantee that the compiler won't choose to reorder the instructions as it did for the first program. The result may change when you upgrade your compiler, or introduce other instructions, or for any other reason. If you start compiling on ARM, all bets are off ;) It's also interesting that despite relaxing the requirements in the source program, the generated assembler is the same. There's no way to relax the memory ordering outside the restrictions that the processor architecture puts in place.

Question 2

There are a bunch of issues here: (1) Releases and acquires must be in pairs. Otherwise, they don't establish synchronization and don't guarantee anything. (2) Even if you make the stores release and the loads acquire in your example, the memory model still allows r1=r2=0. You need to make everything seq_cst to forbid that execution. (3) We've built a tool at http://demsky.eecs.uci.edu/c11modelchecker.html for testing C11 atomic code. It will give you all executions allowed under reasonable interpretations of the C/C++11 memory model.

You may not see these interesting behaviors on current GCC versions yet, as at least the earlier versions ignored the memory ordering parameter and always used seq_cst. If GCC changes that, you could see r1=r2=0.