Compare and swap in machine code in C

https://stackoverflow.com/questions/4213639

26-09-2019
|

Question

How would you write a function in C which does an atomic compare and swap on an integer value, using embedded machine code (assuming, say, x86 architecture)? Can it be any more specific if its written only for the i7 processor?

Does the translation act as a memory fence, or does it just ensure ordering relation just on that memory location included in the compare and swap? How costly is it compared to a memory fence?

Thank you.

Solution

The easiest way to do it is probably with a compiler intrinsic like _InterlockedCompareExchange(). It looks like a function but is actually a special case in the compiler that boils down to a single machine op. In the case of the MSVC x86 intrinsic, that works as a read/write fence as well, but that's not necessarily true on other platforms. (For example, on the PowerPC, you'd need to explicitly issue a lwsync to fence memory reordering.)

In general, on many common systems, a compare-and-swap operation usually only enforces an atomic transaction upon the one address it's touching. Other memory access can be reordered, and in multicore systems, memory addresses other than the one you've swapped may not be coherent between the cores.

OTHER TIPS

You can use the CMPXCHG instruction with the LOCK prefix for atomic execution.

E.g.

lock cmpxchg DWORD PTR [ebx], edx

lock cmpxchgl %edx, (%ebx)

This compares the value in the EAX register with the value at the address stored in the EBX register and stores the value in the EDX register to that location if they are the same, otherwise it loads the value at the address stored in the EBX register into EAX.

You need to have a 486 or later for this instruction to be available.

If your integer value is 64 bit than use cmpxchg8b 8 byte compare and exchange under IA32 x86. Variable must be 8 byte aligned.

Example:
      mov   eax, OldDataA           //load Old first 32 bits
      mov   edx, OldDataB           //load Old second 32 bits
      mov   ebx, NewDataA           //load first 32 bits
      mov   ecx, NewDataB           //load second 32 bits
      mov   edi, Destination        //load destination pointer
      lock cmpxchg8b qword ptr [edi]
      setz  al                      //if transfer is succesful the al is 1 else 0

If the LOCK prefix is omitted in atomic processor instructions, atomic operation across multiprocessor environment will not be guaranteed.

In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted. Intel Instruction Set Reference

Without LOCK prefix the operation will guarantee not being interrupted by any event (interrupt) on current processor/core only.

It's interesting to note that some processors don't provide a compare-exchange, but instead provide some other instructions ("Load Linked" and "Conditional Store") that can be used to synthesize the unfortunately-named compare-and-swap (the name sounds like it should be similar to "compare-exchange" but should really be called "compare-and-store" since it does the comparison, stores if the value matches, and indicates whether the value matched and the store was performed). The instructions cannot synthesize compare-exchange semantics (which provides the value that was read in case the compare failed), but may in some cases avoid the ABA problem which is present with Compare-Exchange. Many algorithms are described in terms of "CAS" operations because they can be used on both styles of CPU.

A "Load Linked" instruction tells the processor to read a memory location and watch in some way to see if it might be written. A "Conditional Store" instruction instructs the processor to write a memory location only if nothing can have written it since the last "Load Linked" operation. Note that the determination may be pessimistic; processing an interrupt, for example, may invalidate a "Load-Linked"/"Conditional Store" sequence. Likewise in a multi-processor system, an LL/CS sequence may be invalidated by another CPU accessing to a location on the same cache line as the location being watched, even if the actual location being watched wasn't touched. In typical usage, LL/CS are used very close together, with a retry loop, so that erroneous invalidations may slow things down a little but won't cause much trouble.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow