ARM64: LDXR/STXR vs LDAXR/STLXR

Question 1

OSAtomicAdd32Barrier() exists for people that are using OSAtomicAdd() for something beyond just atomic increment. Specifically, they are implementing their own multi-processing synchronization primitives based on OSAtomicAdd(). For example, creating their own mutex library. OSAtomicAdd32Barrier() uses heavy barrier instructions to enforce memory ordering on both side of the atomic operation. This is not desirable in normal usage.

To summarize:

1) If you just want to increment an integer in a thread-safe way, use OSAtomicAdd32()

2) If you are stuck with a bunch of old code that foolishly assumes OSAtomicAdd32() can be used as an interprocessor memory ordering and speculation barrier, replace it with OSAtomicAdd32Barrier()

Question 2

Oh, the mind-bending horror of weak memory ordering...

The first snippet is your basic atomic read-modify-write - if someone else touches whatever address x1 points to, the store-exclusive will fail and it will try again until it succeeds. So far so good. However, this only applies to the address (or more rightly region) covered by the exclusive monitor, so whilst it's good for atomicity, it's ineffective for synchronisation of anything other than that value.

Consider a case where CPU1 is waiting for CPU0 to write some data to a buffer. CPU1 sits there waiting on some kind of synchronisation object (let's say a semaphore), waiting for CPU0 to update it to signal that new data is ready.

CPU0 writes to the data address.
CPU0 increments the semaphore (atomically, as you do) which happens to be elsewhere in memory.
???
CPU1 sees the new semaphore value.
CPU1 reads some data, which may or may not be the old data, the new data, or some mix of the two.

Now, what happened at step 3? Maybe it all occurred in order. Quite possibly, the hardware decided that since there was no address dependency it would let the store to the semaphore go ahead of the store to the data address. Maybe the semaphore store hit in the cache whereas the data didn't. Maybe it just did so because of complicated reasons only those hardware guys understand. Either way it's perfectly possible for CPU1 to see the semaphore update before the new data has hit memory, thus read back invalid data.

To fix this, CPU0 must have a barrier between steps 1 and 2, to ensure the data has definitely been written before the semaphore is written. Having the atomic write be a barrier is a nice simple way to do this. However since barriers are pretty performance-degrading you want the lightweight no-barrier version as well for situations where you don't need this kind of full synchronisation.

Now, the even less intuitive part is that CPU1 could also reorder its loads. Again since there is no address dependency, it would be free to speculate the data load before the semaphore load irrespective of CPU0's barrier. Thus CPU1 also needs its own barrier between steps 4 and 5.

For the more authoritative, but pretty heavy going, version have a read of ARM's Barrier Litmus Tests and Cookbook. Be warned, this stuff can be confusing ;)

As an aside, in this case the architectural semantics of acquire/release complicate things further. Since they are only one-way barriers, whilst OSAtomicAdd32Barrier adds up to a full barrier relative to code before and after it, it doesn't actually guarantee any ordering relative to the atomic operation itself - see this discussion from Linux for more explanation. Of course, that's from the theoretical point of view of the architecture; in reality it's not inconceivable that the A7 hardware has taken the 'simple' option of wiring up LDAXR to just do DMB+LDXR, and so on, meaning they can get away with this since they're at liberty to code to their own implementation, rather than the specification.

Question 3

I would guess that this is simply a way of reproducing existing architecture-independent semantics for this operation.

With the ldaxr/stlxr pair, the above sequence will assure correct ordering if the AtomicAdd32 is used as a synchronization mechanism (mutex/semaphore) - regardless of whether the resulting higher-level operation is an acquire or release.

So - this is not about enforcing consistency of the atomic add, but about enforcing ordering between acquiring/releasing a mutex and any operations performed on the resource protected by that mutex.

It is less efficient than the ldxar/stxr or ldxr/stlxr you would use in a normal native synchronization mechanism, but if you have existing platform-independent code expecting an atomic add with those semantics, this is probably the best way to implement it.