When is a compiler-only memory barrier (such as std::atomic_signal_fence) useful?

Question 1

To answer all 5 questions:

1) A compiler fence (by itself, without a CPU fence) is only useful in two situations:

To enforce memory order constraints between a single thread and asynchronous interrupt handler bound to that same thread (such as a signal handler).
To enforce memory order constraints between multiple threads when it is guaranteed that every thread will execute on the same CPU core. In other words, the application will only run on single core systems, or the application takes special measures (through processor affinity) to ensure that every thread which shares the data is bound to the same core.

2) The memory model of the underlying architecture, whether it's strongly- or weakly-ordered, has no bearing on whether a compiler-fence is needed in a situation.

3) Here is pseudo-code which demonstrates the use of a compiler fence, by itself, to sufficiently synchronize memory access between a thread and an async signal handler bound to the same thread:

void async_signal_handler()
{
    if ( is_shared_data_initialized )
    {
        compiler_only_memory_barrier(memory_order::acquire);
        ... use shared_data ...
    }
}

void main()
{
// initialize shared_data ...
    shared_data->foo = ...
    shared_data->bar = ...
    shared_data->baz = ...
// shared_data is now fully initialized and ready to use
    compiler_only_memory_barrier(memory_order::release);
    is_shared_data_initialized = true;
}

Important Note: This example assumes that async_signal_handler is bound to the same thread that initializes shared_data and sets the is_initialized flag, which means the application is single-threaded, or it sets thread signal masks accordingly. Otherwise, the compiler fence would be insufficient, and a CPU fence would also be needed.

4) They should be the same. acq_rel and seq_cst should both result in a full (bidirectional) compiler fence, with no fence-related CPU instructions emitted. The concept of "sequential consistency" only comes into play when multiple cores and threads are involved, and atomic_signal_fence only pertains to one thread of execution.

5) No. (Unless of course, the thread-local data is accessed from an asynchronous signal handler in which case a compiler fence might be necessary.) Otherwise, fences should never be needed with thread-local data since the compiler (and CPU) are only allowed to reorder memory accesses in ways that do not change the observable behavior of the program with respect to its sequence points from a single-threaded perspective. And one can logically think of thread-local statics in a multi-threaded program to be the same as global statics in a single-threaded program. In both cases, the data is only accessible from a single thread, which prevents a data race from occuring.

Question 2

There are actually some nonportable but useful C programming idioms where compiler fences are useful, even in multicore code (particularly in pre-C11 code). The typical situation is where the program is doing some accesses that would normally be made volatile (because they are to shared variables), but you want the compiler to be able to move the accesses around. If you know that the accesses are atomic on the target platform (and you take some other precautions), you can leave the accesses nonvolatile, but contain code movement using compiler barriers.

Thankfully, most programming like this is made obsolete with C11/C++11 relaxed atomics.