What is the (slight) difference on the relaxing atomic rules?

Question 1

It can be a bit confusing to call myatomic.load(std::memory_order_acquire); a "relaxed atomic" load, since there is a std::memory_order_relaxed. Some people describe any order weaker than seq_cst as "relaxed".

You're right to note that sequentially-consistent load is an acquire load, but it has an additional requirement: sequentially-consistent load is also a part of the total global order for all seq_cst operations.

It comes into play when you're dealing with more than one atomic variable: individual modification orders of two atomics may appear in different relative order to different threads, unless sequential consistency is imposed.

Question 2

If you "relax" some ordering requirements of seq_cst, there's mo_acq_rel (and pure acquire and pure release).

Even more relaxed than that is mo_relaxed; no ordering wrt. anything else, just atomicity¹.

When compiling for most ISAs, a seq_cst load can use the same asm as acquire loads; we choose to make stores expensive, not loads. C/C++11 mappings to processors for ISAs including x86, POWER, ARMv7, ARMv8 includes 2 alternatives for some ISAs. To be compatible with each other, compilers for the same platform have to pick the same strategy, otherwise a seq_cst store in one function could maybe reorder with a seq_cst load in another function.

On a typical CPU where the memory model includes a store buffer and coherent cache, if you store and then reload in the same thread, seq_cst requires that you don't let the reload happen until after the store is globally visible to all threads. This means either a full barrier (including StoreLoad) after seq_cst stores or before seq_cst loads. Since cheap loads are more valuable than cheap stores, the usual mapping picks x86 mov + mfence for stores, for example. (Same applies for loading any other location; can't do that until the store commits. That's what Jeff Preshing's Memory Reordering Caught in the Act is about.)

This is a practical example of creating a global total order of operations on different variables that all threads can agree on. (x86 asm provides acquire for pure-load / release for pure-store, or seq_cst for lock-prefixed atomic RMW instructions. So Preshing's x86 asm example corresponds exactly to C++11 mo_release stores instead of mo_seq_cst.

ARMv8 / AArch64 is interesting: it has STLR (sequential-release store) and LDAR (acquire load). Instead of stalling all later loads until the store buffer drains and commits an STLR to L1d cache (global visibility), an implementation can be more efficient.

Waiting for flush only has to happen before an LDAR executes; other loads can execute, and even later stores can commit to L1d. (A sequential-release is still at minimum a one-way barrier). To be this efficient / weak, LDAR has to probe the store buffer to check for STLR stores. But if you can do that, mo_seq_cst stores can be significantly cheaper than on x86 if you don't do a seq_cst load of anything else right away after that.

On most other ISAs, the only option to recover sequential consistency is a full barrier instruction (after a store). This blocks all later loads and stores from happening until after all previous stores commit to L1d cache. But that's not what ISO C++ seq_cst implies or requires, it's just that only AArch64 has the capability to be as strong as ISO C++ requires but no stronger.

(Compiling for many other weakly-ordered ISAs needs to promote acq / release to significantly stronger than needed, e.g. ARMv7 needs a full barrier for release stores.)

Footnote 1: (Like what you get in old pre-C++11 code using roll-your-own atomics using volatile without any barriers).

Question 3

And if that is true, and the default seq_cst means both, doesn't that mean a full fence

It absolutely does not mean both or "full fence" whatever that is.

seq_cst implies

acquire only on load operations
and release only on store operations.

So it implies both only on the operations that combine both: the RMW atomic operations.

Sequential consistency also means that these operations are globally ordered, that is: all operations marked seq_cst of the whole program are run in some sequential order, that is an order compatible with the sequencing of operations in each thread. It says nothing about the order of other atomic operations with respect to these "sequential" operations.

The intent of a seq_cst operation on an atomic object is not to provide a "fence" that would make all other memory operations sequential.