Interlocked and Memory Barriers

https://stackoverflow.com/questions/1757942

20-09-2019
|

Question

I have a question about the following code sample (m_value isn't volatile, and every thread runs on a separate processor)

void Foo() // executed by thread #1, BEFORE Bar() is executed
{
   Interlocked.Exchange(ref m_value, 1);
}

bool Bar() // executed by thread #2, AFTER Foo() is executed
{
   return m_value == 1;
}

Does using Interlocked.Exchange in Foo() guarantees that when Bar() is executed, I'll see the value "1"? (even if the value already exists in a register or cache line?) Or do I need to place a memory barrier before reading the value of m_value?

Also (unrelated to the original question), is it legal to declare a volatile member and pass it by reference to InterlockedXX methods? (the compiler warns about passing volatiles by reference, so should I ignore the warning in such case?)

Please Note, I'm not looking for "better ways to do things", so please don't post answers that suggest completely alternate ways to do things ("use a lock instead" etc.), this question comes out of pure interest..

Solution

The usual pattern for memory barrier usage matches what you would put in the implementation of a critical section, but split into pairs for the producer and consumer. As an example your critical section implementation would typically be of the form:

while (!pShared->lock.testAndSet_Acquire()) ;
// (this loop should include all the normal critical section stuff like
// spin, waste, 
// pause() instructions, and last-resort-give-up-and-blocking on a resource 
// until the lock is made available.)

// Access to shared memory.

pShared->foo = 1 
v = pShared-> goo

pShared->lock.clear_Release()

Acquire memory barrier above makes sure that any loads (pShared->goo) that may have been started before the successful lock modification are tossed, to be restarted if neccessary.

The release memory barrier ensures that the load from goo into the (local say) variable v is complete before the lock word protecting the shared memory is cleared.

You have a similar pattern in the typical producer and consumer atomic flag scenerio (it is difficult to tell by your sample if that is what you are doing but should illustrate the idea).

Suppose your producer used an atomic variable to indicate that some other state is ready to use. You'll want something like this:

pShared->goo = 14

pShared->atomic.setBit_Release()

Without a "write" barrier here in the producer you have no guarantee that the hardware isn't going to get to the atomic store before the goo store has made it through the cpu store queues, and up through the memory hierarchy where it is visible (even if you have a mechanism that ensures the compiler orders things the way you want).

In the consumer

if ( pShared->atomic.compareAndSwap_Acquire(1,1) )
{
   v = pShared->goo 
}

Without a "read" barrier here you won't know that the hardware hasn't gone and fetched goo for you before the atomic access is complete. The atomic (ie: memory manipulated with the Interlocked functions doing stuff like lock cmpxchg), is only "atomic" with respect to itself, not other memory.

Now, the remaining thing that has to be mentioned is that the barrier constructs are highly unportable. Your compiler probably provides _acquire and _release variations for most of the atomic manipulation methods, and these are the sorts of ways you would use them. Depending on the platform you are using (ie: ia32), these may very well be exactly what you would get without the _acquire() or _release() suffixes. Platforms where this matters are ia64 (effectively dead except on HP where its still twitching slightly), and powerpc. ia64 had .acq and .rel instruction modifiers on most load and store instructions (including the atomic ones like cmpxchg). powerpc has separate instructions for this (isync and lwsync give you the read and write barriers respectively).

Now. Having said all this. Do you really have a good reason for going down this path? Doing all this correctly can be very difficult. Be prepared for a lot of self doubt and insecurity in code reviews and make sure you have a lot of high concurrency testing with all sorts of random timing scenerios. Use a critical section unless you have a very very good reason to avoid it, and don't write that critical section yourself.

OTHER TIPS

Memory barriers don't particularly help you. They specify an ordering between memory operations, in this case each thread only has one memory operation so it doesn't matter. One typical scenario is writing non-atomically to fields in a structure, a memory barrier, then publishing the address of the structure to other threads. The Barrier guarantees that the writes to the structures members are seen by all CPUs before they get the address of it.

What you really need are atomic operations, ie. InterlockedXXX functions, or volatile variables in C#. If the read in Bar were atomic, you could guarantee that neither the compiler, nor the cpu, does any optimizations that prevent it from reading either the value before the write in Foo, or after the write in Foo depending on which gets executed first. Since you are saying that you "know" Foo's write happens before Bar's read, then Bar would always return true.

Without the read in Bar being atomic, it could be reading a partially updated value (ie. garbage), or a cached value (either from the compiler or from the CPU), both of which may prevent Bar from returning true which it should.

Most modern CPU's guarantee word aligned reads are atomic, so the real trick is that you have to tell the compiler that the read is atomic.

I'm not completely sure but I think the Interlocked.Exchange will use the InterlockedExchange function of the windows API that provides a full memory barrier anyway.

This function generates a full memory barrier (or fence) to ensure that memory operations are completed in order.

The interlocked exchange operations guarantee a memory barrier.

The following synchronization functions use the appropriate barriers to ensure memory ordering:

Functions that enter or leave critical sections

Functions that signal synchronization objects

Wait functions

Interlocked functions

(Source : link)

But you are out of luck with register variables. If m_value is in a register in Bar, you won't see the change to m_value. Due to this, you should declare shared variables 'volatile'.

If m_value is not marked as volatile, then there is no reason to think that the value read in Bar is fenced. Compiler optimizations, caching, or other factors could reorder the reads and writes. Interlocked exchange is only helpful when it is used in an ecosystem of properly fenced memory references. This is the whole point of marking a field volatile. The .Net memory model is not as straight forward as some might expect.

Interlocked.Exchange() should guarantee that the value is flushed to all CPUs properly - it provides its own memory barrier.

I'm surprised that the compiler is complaing about passing a volatile into Interlocked.Exchange() - the fact that you're using Interlocked.Exchange() should almost mandate a volatile variable.

The problem you might see is that if the compiler does some heavy optimization of Bar() and realizes that nothing changes the value of m_value it can optimize away your check. That's what the volatile keyword would do - it would hint to the compiler that that variable may be changed outside of the optimizer's view.

If you don't tell the compiler or runtime that m_value should not be read ahead of Bar(), it can and may cache the value of m_value ahead of Bar() and simply use the cached value. If you want to ensure that it sees the "latest" version of m_value, either shove in a Thread.MemoryBarrier() or use Thread.VolatileRead(ref m_value). The latter is less expensive than a full memory barrier.

Ideally you could shove in a ReadBarrier, but the CLR doesn't seem to support that directly.

EDIT: Another way to think about it is that there are really two kinds of memory barriers: compiler memory barriers that tell the compiler how to sequence reads and writes and CPU memory barriers that tell the CPU how to sequence reads and writes. The Interlocked functions use CPU memory barriers. Even if the compiler treated them as compiler memory barriers, it still wouldn't matter, as in this specific case, Bar() could have been separately compiled and not known of the other uses of m_value that would require a compiler memory barrier.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow