To answer all 5 questions:
1) A compiler fence (by itself, without a CPU fence) is only useful in two situations:
To enforce memory order constraints between a single thread and asynchronous interrupt handler bound to that same thread (such as a signal handler).
To enforce memory order constraints between multiple threads when it is guaranteed that every thread will execute on the same CPU core. In other words, the application will only run on single core systems, or the application takes special measures (through processor affinity) to ensure that every thread which shares the data is bound to the same core.
2) The memory model of the underlying architecture, whether it's strongly- or weakly-ordered, has no bearing on whether a compiler-fence is needed in a situation.
3) Here is pseudo-code which demonstrates the use of a compiler fence, by itself, to sufficiently synchronize memory access between a thread and an async signal handler bound to the same thread:
void async_signal_handler()
{
if ( is_shared_data_initialized )
{
compiler_only_memory_barrier(memory_order::acquire);
... use shared_data ...
}
}
void main()
{
// initialize shared_data ...
shared_data->foo = ...
shared_data->bar = ...
shared_data->baz = ...
// shared_data is now fully initialized and ready to use
compiler_only_memory_barrier(memory_order::release);
is_shared_data_initialized = true;
}
Important Note: This example assumes that async_signal_handler
is bound to the same thread that initializes shared_data
and sets the is_initialized
flag, which means the application is single-threaded, or it sets thread signal masks accordingly. Otherwise, the compiler fence would be insufficient, and a CPU fence would also be needed.
4) They should be the same. acq_rel
and seq_cst
should both result in a full (bidirectional) compiler fence, with no fence-related CPU instructions emitted. The concept of "sequential consistency" only comes into play when multiple cores and threads are involved, and atomic_signal_fence
only pertains to one thread of execution.
5) No. (Unless of course, the thread-local data is accessed from an asynchronous signal handler in which case a compiler fence might be necessary.) Otherwise, fences should never be needed with thread-local data since the compiler (and CPU) are only allowed to reorder memory accesses in ways that do not change the observable behavior of the program with respect to its sequence points from a single-threaded perspective. And one can logically think of thread-local statics in a multi-threaded program to be the same as global statics in a single-threaded program. In both cases, the data is only accessible from a single thread, which prevents a data race from occuring.