LFENCE does not have acquire semantics; SFENCE does not have release semantics. There's a good reason for that: Having a stand-alone fence instruction with acquire semantics, or release semantics, turns out to be almost completely useless. For an acquire/release to do any good, it must be tied to a memory operation.
For example, consider the common idiom for sending data between two threads:
- Processor A writes into a buffer.
- Processor A writes "true" into a flag.
- Processor B waits until the flag is true.
- Processor B reads the buffer.
Note that processor A must ensure that its write to the flag is seen after it writes to the buffer. Now suppose we had a "RFENCE" instruction that is a release fence. If we put the instruction immediately after step (1), it does no good, because the write in step 2 is allowed to appear to migrate up over RFENCE and up over step 1.
A similar argument shows that a "AFENCE" instruction that does an acquire is equally useless for ensuring that the read of the flag in step 3 does not appear to migrate downwards across step 4.
Itanium solved the problem elegantly by providing write-with-release and load-with-acquire instructions that tie the fence to a memory operation.
Back to IA-32 and Intel64: If a program does not use "non-temporal" instructions, then the remaining instructions behave as if every load does an "acquire" and every store does a "release". See Section 8.2.3 (and subsections) of Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A. If there are "non-temporal" stores involved, you have several ways to enforce a fence:
- Use SFENCE
- Use MFENCE - somewhat overkill
- Use a LOCK-prefixed instruction (such as "LOCK INC") to write the flag. LOCK-prefixed instructions implicitly have MFENCEs.
- Use XCHG, which acts as if it has an implicit LOCK prefix, to write the flag.
For example, if in the earlier idiom, the buffer is written using non-temporary stores, have processor A issue a SFENCE or MFENCE between steps 1 and 2. Or use XCHG to write the flag.
All of the above remarks apply to the hardware. When using a high-level language, be sure that the compiler does not damage the critical ordering of events. The C++11 atomic operations library exists so that you can tell the compiler and hardware what you intend.