C11 memory fence usage

Question 1

To force an order following the bucket-read, I guess I would need an explicit atomic_thread_fence() between the bucket read and the following atomic_store.

I do not believe the atomic_thread_fence() call is necessary: the flag update has release semantics, preventing any preceding load or store operations from being reordered across it. See the formal definition by Herb Sutter:

A write-release executes after all reads and writes by the same thread that precede it in program order.

This should prevent the read of bucket from being reordered to occur after the flag update, regardless of where the compiler chooses to store data.

That brings me to your comment about another answer:

The volatile ensures that there are ld/st operations generated, which can subsequently be ordered with fences. However, data is a local variable, not volatile. The compiler will probably put it in register, avoiding a store operation. That leaves the load from bucket to be ordered with the subsequent reset of flag.

It would seem that is not an issue if the bucket read cannot be reordered past the flag write-release, so volatile should not be necessary (though it probably doesn't hurt to have it, either). It's also unnecessary because most function calls (in this case, atomic_store_explicit(&flag)) serve as compile-time memory barriers. The compiler would not reorder the read of a global variable past a non-inlined function call because that function could modify the same variable.

I would also agree with @MaximYegorushkin that you could improve your busy-waiting with pause instructions when targeting compatible architectures. GCC and ICC both appear to have _mm_pause(void) intrinsics (probably equivalent to __asm__ ("pause;")).

Question 2

I agree with what @MikeStrobel says in his comment.

You don't need atomic_thread_fence() here because your critical sections start with acquire and end with release semantics. Hence, reads within your critical sections can not be reordered prior to the acquire and writes post the release. And this is why volatile is unnecessary here as well.

In addition, I don't see a reason why (pthread) spinlock is not used here instead. spinlock does a similar busy spin for you but it also uses pause instruction:

The pause intrinsic is used in spin-wait loops with the processors implementing dynamic execution (especially out-of-order execution). In the spin-wait loop, the pause intrinsic improves the speed at which the code detects the release of the lock and provides especially significant performance gain. The execution of the next instruction is delayed for an implementation-specific amount of time. The PAUSE instruction does not modify the architectural state. For dynamic scheduling, the PAUSE instruction reduces the penalty of exiting from the spin-loop.

Question 3

Direct answer:

That your store is a memory_order_release operation means that your compiler has to emit a memory fence for store instruction before the store of flag. This is required to ensure that other processors see the final state of the released data before they start interpreting it. So, no, you don't need to add a second fence.

Long answer:

As noted above, what happens is that the compiler transforms your atomic_... instructions into combinations of fences and memory accesses; the fundamental abstraction is not the atomic load, it is the memory fence. That is how things work, even though the new C++ abstractions entice you to think differently. And I personally find memory fences much easier to think about than the contrieved abstractions in C++.

From a hardware perspective, what you need to ensure, is the relative order of your loads and stores, i. e. that the write to bucket completes before flag is written in the producer, and that the load of flag reads a value older than the load of bucket in the consumer.

That said, what you actually need is this:

//producer
while(true) {
    int value = producer_work();
    while (flag) ; // busy wait
    atomic_thread_fence(memory_order_acquire);  //ensure that value is not assigned to bucket before the flag is lowered
    bucket = value;
    atomic_thread_fence(memory_order_release);  //ensure bucket is written before flag is
    flag = true;
}

//consumer
while(true) {
    while(!flag) ; // busy wait
    atomic_thread_fence(memory_order_acquire);  //ensure the value read from bucket is not older than the last value read from flag
    int data = bucket;
    atomic_thread_fence(memory_order_release);  //ensure data is loaded from bucket before the flag is lowered again
    flag = false;
    consumer_work(data);
}

Note, that the labels "producer" and "consumer" are misleading here, because we have two processes playing ping pong, each becoming producer and consumer in turn; it's just that one thread produces useful values, while the other produces "holes" to write useful values into...

atomic_thread_fence() is all you need, and since it directly translates to the assembler instructions below the atomic_... abstractions, it is guaranteed to be the fastest approach.