How are atomic operations implemented at a hardware level?

Question 1

Here is an article over at software.intel.com on that sheds little light on user level locks:

User level locks involve utilizing the atomic instructions of processor to atomically update a memory space. The atomic instructions involve utilizing a lock prefix on the instruction and having the destination operand assigned to a memory address. The following instructions can run atomically with a lock prefix on current Intel processors: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. [...] On most instructions a lock prefix must be explicitly used except for the xchg instruction where the lock prefix is implied if the instruction involves a memory address.

In the days of Intel 486 processors, the lock prefix used to assert a lock on the bus along with a large hit in performance. Starting with the Intel Pentium Pro architecture, the bus lock is transformed into a cache lock. A lock will still be asserted on the bus in the most modern architectures if the lock resides in uncacheable memory or if the lock extends beyond a cache line boundary splitting cache lines. Both of these scenarios are unlikely, so most lock prefixes will be transformed into a cache lock which is much less expensive.

So what prevents another core from accessing the memory address? The cache coherency protocol already manages access rights for cache lines. So if a core has (temporal) exclusive access rights to a cache line, no other core can access that cache line. To access that cache line the other core has to obtain access rights first, and the protocol to obtain those rights involves the current owner. In effect, the cache coherency protocol prevents other cores from accessing the cache line silently.

If the locked access is not bound to a single cache line things get more complicated. There are all kinds of nasty corner cases, like locked accesses over page boundaries, etc. Intel does not tell details and they probably use all kinds of tricks to make locks faster.

Question 2

An example implementation of this is LL/SC where a processor will actually have extra instructions that are used to complete atomic operations. On the memory side of it is cache coherency. One of the most popular cache coherency protocols is the MESI Protocol. .

Question 3

Cache coherency protocol by itself is not sufficient to implement atomic operations. Lets say you want to implement an atomic increment. Below are the steps involved

Load the value from the cache into a register
Increment the value loaded into the register
Store the updated value back to the cache

So in order to implement the above 3 instructions in an atomic fashion, we should first get exclusive access to the cacheline which contains the required value. Once we get exclusive access, we should not relinquish exclusive access on this cacheline until the "store" operation is completed. This means the CPU executing the atomic instructions should not respond to any cache coherency protocol messages for this cacheline in the mean time. While the devil is in the details of how this is implemented, at-least it gives us a mental model

Below is what linus torvalds mentioned about atomic instructions

Atomic instructions bypass the store buffer or at least they act as if they do - they likely actually use the store buffer, but they flush it and the instruction pipeline before the load and wait for it to drain after, and have a lock on the cacheline that they take as part o the load, and release as part of the store - all to make sure that the cacheline doesn't go away in between and that nobody else can see the store buffer contents while this is going on.

Question 4

The memory controller is only in charge of making sure that memory & cache on different processors stays consistent - if you write to memory on CPU1, CPU2 won't be able to read something else from its cache. It's not its responsibility to make sure that they're both trying to manipulate the same data. There are a few low level instructions used locking and atomic operations. These are used at the OS level to manipulate small chunks of memory to create things like mutexes and semaphores, these are literally one or two bytes of memory that need to have atomic, synchronized operations performed on them. Applications then build on top of this to perform operations on larger data structures and resources.