Question

I am curring looking at some assembly generated for atomic operations by gcc. I tried the following short sequence:

int x1;
int x2;

int foo;

void test()
{
  __atomic_store_n( &x1, 1, __ATOMIC_SEQ_CST );
  if( __atomic_load_n( &x2  ,__ATOMIC_SEQ_CST ))
    return;

  foo = 4;
}

Looking at Herb Sutter's atomic weapons talk on code generation, he mentions that the X86 manual mandates to use xchg for atomic stores and a simple mov for atomic reads. So I was expecting something along the lines of:

test():
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $1, %eax
    xchg    %eax, x1(%rip)
    movl    x2(%rip), %eax
    testl   %eax, %eax
    setne   %al
    testb   %al, %al
    je      .L2
    jmp     .L1
.L2:
    movl    $4, foo(%rip)
.L1:
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

Where the memory fence is implicit because of the locked xchg instruction.

However if I compile this using gcc -march=core2 -S test.cc I get the following:

test():
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $1, %eax
    movl    %eax, x1(%rip)
    mfence
    movl    x2(%rip), %eax
    testl   %eax, %eax
    setne   %al
    testb   %al, %al
    je      .L2
    jmp     .L1
.L2:
    movl    $4, foo(%rip)
.L1:
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

So instead of using a xchg operation gcc here uses a mov + mfence combination. What is the reason here for this code generation, which differs from the one mandated by the x86 architecture according to Herb Sutter?

Was it helpful?

Solution

The xchg instruction has implied lock semantics when the destination is a memory location. What this means is you can swap the contents of a register with the contents of a memory location atomically.

The example in the question is doing an atomic store, not a swap. The x86 architecture memory model guarantees that in a multi-processor/multi-core system stores done by one thread will be seen in that order by other threads... therefore a memory move is sufficient. Having said that, there are older Intel CPUs and some clones where there are bugs in this area, and an xchg is required as a workaround on those CPUs. See the Significant optimizations section of this wikipedia article on spinlocks:

http://en.wikipedia.org/wiki/Spinlock#Example_implementation

Which states

The simple implementation above works on all CPUs using the x86 architecture. However, a number of performance optimizations are possible:

On later implementations of the x86 architecture, spin_unlock can safely use an unlocked MOV instead of the slower locked XCHG. This is due to subtle memory ordering rules which support this, even though MOV is not a full memory barrier. However, some processors (some Cyrix processors, some revisions of the Intel Pentium Pro (due to bugs), and earlier Pentium and i486 SMP systems) will do the wrong thing and data protected by the lock could be corrupted. On most non-x86 architectures, explicit memory barrier or atomic instructions (as in the example) must be used. On some systems, such as IA-64, there are special "unlock" instructions which provide the needed memory ordering.

The memory barrier, mfence, ensures that all stores have completed (store buffers in the CPU core are empty and values stored in the cache or memory), it also ensures that no future loads execute out of order.

The fact a MOV is sufficient to unlock the mutex (no serialization or memory barrier required) was "officially" clarified in a reply to Linus Torvalds by an Intel architect back in 1999

http://lkml.org/lkml/1999/11/24/90.

I guess it was later discovered that didn't work for some older x86 processors.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top