Atomic operations and code generation for gcc

Question

The xchg instruction has implied lock semantics when the destination is a memory location. What this means is you can swap the contents of a register with the contents of a memory location atomically.

The example in the question is doing an atomic store, not a swap. The x86 architecture memory model guarantees that in a multi-processor/multi-core system stores done by one thread will be seen in that order by other threads... therefore a memory move is sufficient. Having said that, there are older Intel CPUs and some clones where there are bugs in this area, and an xchg is required as a workaround on those CPUs. See the Significant optimizations section of this wikipedia article on spinlocks:

http://en.wikipedia.org/wiki/Spinlock#Example_implementation

Which states

The simple implementation above works on all CPUs using the x86 architecture. However, a number of performance optimizations are possible:

On later implementations of the x86 architecture, spin_unlock can safely use an unlocked MOV instead of the slower locked XCHG. This is due to subtle memory ordering rules which support this, even though MOV is not a full memory barrier. However, some processors (some Cyrix processors, some revisions of the Intel Pentium Pro (due to bugs), and earlier Pentium and i486 SMP systems) will do the wrong thing and data protected by the lock could be corrupted. On most non-x86 architectures, explicit memory barrier or atomic instructions (as in the example) must be used. On some systems, such as IA-64, there are special "unlock" instructions which provide the needed memory ordering.

The memory barrier, mfence, ensures that all stores have completed (store buffers in the CPU core are empty and values stored in the cache or memory), it also ensures that no future loads execute out of order.

The fact a MOV is sufficient to unlock the mutex (no serialization or memory barrier required) was "officially" clarified in a reply to Linus Torvalds by an Intel architect back in 1999

http://lkml.org/lkml/1999/11/24/90.

I guess it was later discovered that didn't work for some older x86 processors.