What I do not understand about volatile and Memory-Barrier is

Question

That's a mouthful..

I'm gonna start with a few of your questions, and update my answer.

Loop hoisting a volatile

I have read many places that a volatile variable can not be hoisted from a loop or if, but I cannot find this mentioned any places in the C# spec. Is this a hidden feature?

MSDN says "Fields that are declared volatile are not subject to compiler optimizations that assume access by a single thread". This is kind of a broad statement, but it includes hoisting or "lifting" variables out of a loop.

All writes are volatile in C#

Does this mean that all writes have the same properties without, as with the volatile keyword? Eg ordinary writes in C# has release semantics? and all writes flushes the store buffer of the processor?

Regular writes are not volatile. They do have release semantics, but they don't flush the CPU's write-buffer. At least, not according to the spec.

From Joe Duffy's CLR 2.0 Memory Model

Rule 2: All stores have release semantics, i.e. no load or store may move after one.

I've read a few articles stating that all writes are volatile in C# (like the one you linked to), but this is a common misconception. From the horse's mouth (The C# Memory Model in Theory and Practice, Part 2):

Consequently, the author might say something like, “In the .NET 2.0 memory model, all writes are volatile—even those to non-volatile fields.” (...) This behavior isn’t guaranteed by the ECMA C# spec, and, consequently, might not hold in future versions of the .NET Framework and on future architectures (and, in fact, does not hold in the .NET Framework 4.5 on ARM).

Release semantics

Is this a formal way of saying that the store buffer of a processor is emptied when a volatile write is done?

No, those are two different things. If an instruction has "release semantics", then no store/load instruction will ever be moved below said instruction. The definition says nothing regarding flushing the write-buffer. It only concerns instruction re-ordering.

Delayed writing

I have read many places that writes can be delayed. Is this because of the reordering, and the store buffer?

Yes. Write instructions can be delayed/reordered by either the compiler, the jitter or the CPU itself.

So a volatile write has two properties: release semantics, and store buffer flushing.

Sort of. I prefer to think of it this way:

The C# Specification of the volatile keyword guarantees one property: that reads have acquire-semantics and writes have release-semantics. This is done by emitting the necessary release/acquire fences.

The actual Microsoft's C# implementation adds another property: reads will be fresh, and writes will be flushed to memory immediately and be made visible to other processors. To accomplish this, the compiler emits an OpCodes.Volatile, and the jitter picks this up and tells the processor not to store this variable on its registers.

This means that a different C# implementation that doesn't guarantee immediacy will be a perfectly valid implementation.

Memory Barrier

bool complete = false; 
var t = new Thread (() =>
{
    bool toggle = false;
    while (!complete) toggle = !toggle;
});
t.Start();
Thread.Sleep(1000);
complete = true;
t.Join();     // blocks

But is this always the case? Will a call to Memory.Barrier always flush the store buffer fetch updated values into the processor cache?

Here's a tip: try to abstract yourself away from concepts like flushing the store buffer, or reading straight from memory. The concept of a memory barrier (or a full-fence) is in no way related to the two former concepts.

A memory barrier has one sole purpose: ensure that store/load instructions below the fence are not moved above the fence, and vice-versa. If C#'s Thread.MemoryBarrier just so happens to flush pending writes, you should think about it as a side-effect, not the main intent.

Now, let's get to the point. The code you posted (which blocks when compiled in Release mode and ran without a debugger) could be solved by introducing a full fence anywhere inside the while block. Why? Let's first unroll the loop. Here's how the first few iterations would look like:

if(complete) return;
toggle = !toggle;

if(complete) return;
toggle = !toggle;

if(complete) return;
toggle = !toggle;
...

Because complete is not marked as volatile and there are no fences, the compiler and the cpu are allowed to move the read of the complete field. In fact, the CLR's Memory Model (see rule 6) allows loads to be deleted (!) when coalescing adjacent loads. So, this could happen:

if(complete) return;
toggle = !toggle;
toggle = !toggle;
toggle = !toggle;
...

Notice that this is logically equivalent to hoisting the read out of the loop, and that's exactly what the compiler may do.

By introducing a full-fence either before or after toggle = !toggle, you'd prevent the compiler from moving the reads up and merging them together.

if(complete) return;
toggle = !toggle;
#FENCE
if(complete) return;
toggle = !toggle;
#FENCE
if(complete) return;
toggle = !toggle;
#FENCE
...

In conclusion, the key to solving these issues is ensuring that the instructions will be executed in the correct order. It has nothing to do with how long it takes for other processors to see one processor's writes.