Do atomic CAS-operations on x86_64 and ARM always use std::memory_order_seq_cst?

https://stackoverflow.com/questions/18577584

27-06-2022
|

Question

some_atomic.load(std::memory_order_acquire) does just drop through to a simple load instruction, and some_atomic.store(std::memory_order_release) drops through to a simple store instruction.

It is known that on x86 for the operations load() and store() memory barriers memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel does not require a processor instructions.

But on ARMv8 we known that here are memory barriers both for load() and store(): http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2 http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-2-of-2

About different architectures of CPUs: http://g.oswego.edu/dl/jmm/cookbook.html

Next, but for the CAS-operation on x86, these two lines with different memory barriers are identical in Disassembly code (MSVS2012 x86_64):

    a.compare_exchange_weak(temp, 4, std::memory_order_seq_cst, std::memory_order_seq_cst);
000000013FE71A2D  mov         ebx,dword ptr [temp]  
000000013FE71A31  mov         eax,ebx  
000000013FE71A33  mov         ecx,4  
000000013FE71A38  lock cmpxchg dword ptr [temp],ecx  

    a.compare_exchange_weak(temp, 5, std::memory_order_relaxed, std::memory_order_relaxed);
000000013FE71A4D  mov         ecx,5  
000000013FE71A52  mov         eax,ebx  
000000013FE71A54  lock cmpxchg dword ptr [temp],ecx

Disassembly code compiled by GCC 4.8.1 x86_64 - GDB:

a.compare_exchange_weak(temp, 4, std::memory_order_seq_cst, std::memory_order_seq_cst);
a.compare_exchange_weak(temp, 5, std::memory_order_relaxed, std::memory_order_relaxed);

0x4613b7  <+0x0027>         mov    0x2c(%rsp),%eax
0x4613bb  <+0x002b>         mov    $0x4,%edx
0x4613c0  <+0x0030>         lock cmpxchg %edx,0x20(%rsp)
0x4613c6  <+0x0036>         mov    %eax,0x2c(%rsp)
0x4613ca  <+0x003a>         lock cmpxchg %edx,0x20(%rsp)

Is on x86/x86_64 platforms for any atomic CAS-operations, an example such like this atomic_val.compare_exchange_weak(temp, 1, std::memory_order_relaxed, std::memory_order_relaxed); always satisfied with the ordering std::memory_order_seq_cst?

And if the any CAS operation on the x86 always run with sequential consistency (std::memory_order_seq_cst) regardless of barriers, then on the ARMv8 it is the same?

QUESTION: Should the order of std::memory_order_relaxed for CAS block memory bus on x86 or ARM?

ANSWER: On x86 any compare_exchange_weak() operations with any std::memory_orders(even std::memory_order_relaxed) always translates to the LOCK CMPXCHG with lock bus, to be really atomic, and have equal expensive to XCHG - "the cmpxchg is just as expensive as the xchg instruction".

(An addition: XCHG equal to LOCK XCHG, but CMPXCHG doesn't equal to LOCK CMPXCHG(which is really atomic)

On ARM and PowerPC for any`compare_exchange_weak() for different std::memory_orders there are differents lock's processor instructions, through LL/SC.

Processor memory-barriers-instructions for x86(except CAS), ARM and PowerPC: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

Solution

You shouldn't worry about what instructions the compiler maps a given C11 construct to as this doesn't capture everything. Instead you need to develop code with respect to the guarantees of the C11 memory model. As the above comment notes, your compiler or future compilers are free to reorder relaxed memory operations as long as it doesn't violate the C11 memory model. It is also a worthwhile running your code through a tool like CDSChecker to see what behaviors are allowed under the memory model.

OTHER TIPS

x86 guarantees that loads following loads are ordered, and stores following stores are ordered. Given that CAS requires both loading and storing, all operations have to be ordered around it.

However, it is worth noting that, in the presence of multiple atomics with memory_order_relaxed, the compiler is allowed to reorder them. It cannot do so with memory_order_seq_cst.

I think the compiler emits lock cmpxchg even for memory_order_relaxed because that's the only way to make sure the compare+exchange itself is actually atomic. Like artless_noise said in comments, other architectures can use a Load Linked / Store Conditional to implement compare_exchange_weak(...).

memory_order_relaxed should still let the compiler hoist stores of other variables out of loops, and otherwise reorder memory access at compile time.

If there was a way to do it on x86 that wasn't also a full memory barrier, a good compiler would use it for memory_order_relaxed.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow