Java 8 Unsafe: xxxFence() instructions

Question 1

Summary

CPU cores have special memory ordering buffers to assist them with out-of-order execution. These can be (and typically are) separate for loading and storing: LOBs for load-order buffers and SOBs for store-order buffers.

The fencing operations chosen for the Unsafe API are were selected based on the following assumption: underlying processors will have separate load-order buffers (for reordering loads), store-order buffers (for reordering stores).

Therefore, based on this assumption, from a software point of view, you can request one of three things from the CPU:

Empty the LOBs (loadFence): means that no other instructions will start executing on this core, until ALL entries the LOBs have been processed. In x86 this is an LFENCE.
Empty the SOBs (storeFence): means that no other instructions will start executing on this core, until ALL entries in the SOBs have been processed. In x86 this is an SFENCE.
Empty both LOBs and SOBs(fullFence): means both of the above. In x86 this is an MFENCE.

In reality, each specific processor architecture provides different memory ordering guarantees, which may be more stringent, or more flexible than the above. For example, SPARC architecture can reorder load-store and store-load sequences, whereas x86 will not do that. Furthermore, architectures exist where LOBs and SOBs cannot be controlled individually (i.e. only full-fence is possible). In both cases however:

when the architecture is more flexible, the API simply does not provide access to the "laxer" sequencing combinations as a matter of choise
when the architecture is more stringent, the API simply implements the more stringent sequencing guarantee in all cases (e.g. all 3 calls actually and up being implemented as a full fence)

The reason for the particular API choices is explained in the JEP as per the answer assylias provides which is 100% on-the-spot. If you know about memory ordering and cache coherence, assylias' answer should suffice. I think the fact that they match the standardized instruction in the C++ API was a major factor (simplifies JVM implementation a lot): http://en.cppreference.com/w/cpp/atomic/memory_order In all likelihood, actual implementation will call into the respective C++ API instead of using some special instruction.

Below I have a detailed explanation with x86-based examples, which will provide all the context necessary to understand these things. In fact, the demarcated (section below answers another question: "Can you provide basic examples of how memory fences work to control cache coherence in the x86 architecture?"

The reason for this is that I myself (coming from a software developer and not hardware designer) had trouble understanding what memory reordering is, until I learned specific examples of how cache coherence actually works in x86. This provides invaluable context for discussing memory fences in general (for other architectures as well). At the end I discuss SPARC a bit using the knowledge gained from the x86 examples

The reference [1] is an even more detailed explanation and has a separate section for discussing each of: x86, SPARC, ARM and PowerPC, so it is an excellent read if you are interested in more details.

x86 architecture example

x86 provides 3 types of fencing instructions: LFENCE (load fence), SFENCE (store fence) and MFENCE (load-store fence), so it maps 100% to the Java API.

This is because x86 has separate load-order buffers (LOBs) and store-order buffers (SOBs), so indeed LFENCE/SFENCE instructions apply to the respective buffer, whereas MFENCE applies to both.

SOBs are used to store an outgoing value (from processor to cache system) while the cache coherence protocol works to acquire permission to write to the cache line. LOBs are used to store invalidation requests so that invalidation can execute asynchronously (reduces stalling on the receiving side in the hope that the code executing there will not actually need that value).

Out-of-order stores and SFENCE

Suppose you have a dual processor system with its two CPUs, 0 and 1, executing the routines below. Consider the case where the cache line holding failure is initially owned by CPU 1, whereas the cache line holding shutdown is initially owned by CPU 0.

// CPU 0:
void shutDownWithFailure(void)
{
  failure = 1; // must use SOB as this is owned by CPU 1
  shutdown = 1; // can execute immediately as it is owned be CPU 0
}
// CPU1:
void workLoop(void)
{
  while (shutdown == 0) { ... }
  if (failure) { ...}
}

In the absence of a store fence, CPU 0 may signal a shutdown due to failure, but CPU 1 will exit the loop and NOT got into the failure-handling if block.

This is because CPU0 will write the value 1 for failure to a store-order buffer, also sending out a cache coherence message to acquire exclusive access to the cache line. It will then proceed to the next instruction (while waiting for exclusive access) and update the shutdown flag immediately (this cache line is owned exclusively by CPU0 already so no need to negotiate with other cores). Finally, when it later receives an invalidation confirmation message from CPU1 (regarding failure) it will proceed to process the SOB for failure and write the value to the cache (but the order is by now reversed).

Inserting a storeFence() will fix things:

// CPU 0:
void shutDownWithFailure(void)
{
  failure = 1; // must use SOB as this is owned by CPU 1
  SFENCE // next instruction will execute after all SOBs are processed
  shutdown = 1; // can execute immediately as it is owned be CPU 0
}
// CPU1:
void workLoop(void)
{
  while (shutdown == 0) { ... }
  if (failure) { ...}
}

A final aspect that deserves mention is that x86 has store-forwarding: when a CPU writes a value which gets stuck in an SOB (due to cache coherence), it may subsequently attempt to execute a load instruction for the same address BEFORE the SOB is processed and delivered to the cache. CPUs will therefore consult the SOBs PRIOR to accessing the cache, so the value retrieved in this case is the last-written value from the SOB. this means that stores from THIS core can never be reordered with subsequent loads from THIS core no matter what.

Out-of-order loads and LFENCE

Now, assume you have the store fence in place and are happy that shutdown cannot overtake failure on its way to CPU 1, and focus on the other side. Even in the presence of the store fence, there are scenarios where the wrong thing happens. Consider the case where failure is in both caches (shared) whereas shutdown is only present in and owned exclusively by the cache of CPU0. Bad things can happen as follows:

CPU0 writes 1 to failure; It also sends a message to CPU1 to invalidate its copy of the shared cache line as part of the cache coherence protocol.
CPU0 executes the SFENCE and stalls, waiting for the SOB used for failure to commit.
CPU1 checks shutdown due to the while loop and (realizing it is missing the value) sends a cache coherence message to read the value.
CPU1 receives the message from CPU0 in step 1 to invalidate failure, sending an immediate acknowledgement for it. NOTE: this is implemented using the invalidation queue, so in fact it simply enters a note (allocates an entry in its LOB) to later do the invalidation, but does not actually perform it before sending out the acknowledgement.
CPU0 receives the acknowledgement for failure and proceeds past the SFENCE to the next instruction
CPU0 writes 1 to shutdown without using a SOB, because it already owns the cache line exclusively. no extra message for invalidation is sent as the cache line is exclusive to CPU0
CPU1 receives the shutdown value and commits it to its local cache, proceeding to the next line.
CPU1 checks the failure value for the if statement, but since the invalidate queue (LOB note) is not yet processed, it uses the value 0 from its local cache (does not enter if block).
CPU1 processes the invalidate queue and update failure to 1, but it is already too late...

What we refer to as load order buffers, is actaully the queueing of invalidation requests, and the above can be fixed with:

// CPU 0:
void shutDownWithFailure(void)
{
  failure = 1; // must use SOB as this is owned by CPU 1
  SFENCE // next instruction will execute after all SOBs are processed
  shutdown = 1; // can execute immediately as it is owned be CPU 0
}
// CPU1:
void workLoop(void)
{
  while (shutdown == 0) { ... }
  LFENCE // next instruction will execute after all LOBs are processed
  if (failure) { ...}
}

Your question on x86

Now that you know what SOBs/LOBs do, think about the combinations you mentioned:

loadFence() becomes load_loadstoreFence();

No, a load fence waits for LOBs to be processed, essentially emptying the invalidation queue. This means that all subsequent loads will see up-to-date data (no re-ordering), as they will be fetched from the cache sub-system (which is coherent). Stores CANNNOT be reordered with subsequent loads, because they do not go through the LOB. (and furthermore store forwarding takes care of locally-modified cachce lines) From the perspective of THIS particular core (the one executing the load fence), a store that follows the load fence will execute AFTER all registers have the data loaded. There is no way around it.

load_storeFence() becomes ???

There is no need for a load_storeFence as it does not make sense. To store something you must calculate it using input. To fetch input you must execute loads. The stores will occur using the data fetched from loads. If you want to make sure you see up-to-date values from all OTHER processors when loading use a loadFence. For loads after the fence store-forwarding takes care of consistent ordering.

All other cases are similar.

SPARC

SPARC is even more flexible and can reorder stores with subsequent loads (and loads with subsequent stores). I was not as familiar with SPARC, so my GUESS was that there is no store-forwarding (SOBs are not consulted when reloading an address) so "dirty reads" are possible. In fact I was wrong: I found the SPARC architecture in [3] and the reality is that store-forwarding is threaded. From section 5.3.4:

All loads check the store buffer (same thread only) for read after write (RAW) hazards. A full RAW occurs when the dword address of the load matches that of a store in the STB and all bytes of the load are valid in the store buffer. A partial RAW occurs when the dword addresses match, but all bytes are not valid in the store buffer. (Ex., a ST (word store) followed by an LDX (dword load) to the same address results in a partial RAW, because the full dword is not in the store buffer entry.)

So, different threads consult different store-order buffers hence the possibility for dirty reads after stores.

References

[1] Memory Barriers: a Hardware View for Software Hackers, Linux Technology Center, IBM Beaverton http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf

[2] Intel® 64 and IA-32 ArchitecturesSoftware Developer’s Manual, Volume 3A http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

[3] OpenSPARC T2 Core Microarchitecture Specification http://www.oracle.com/technetwork/systems/opensparc/t2-06-opensparct2-core-microarch-1537749.html

Question 2

A good source of information is the JEP 171 itself.

Rationale:

The three methods provide the three different kinds of memory fences that some compilers and processors need to ensure that particular accesses (loads and stores) do not become reordered.

Implementation (extract):

for the C++ runtime versions (in prims/unsafe.cpp), implementing via the existing OrderAccess methods:

    loadFence:  { OrderAccess::acquire(); }
    storeFence: { OrderAccess::release(); }
    fullFence:  { OrderAccess::fence(); }

In other words the new methods closely relate to how memory fences are implemented at the JVM and CPU levels. They also match the memory barrier instructions available in C++, the language in which hotspot is implemented.

A finer grained approach would probably have been feasible but the benefits are not obvious.

For example if you look at the table of cpu instructions in the JSR 133 Cookbook, you will see that LoadStore and LoadLoad map to the same instructions on most architectures, i.e. both are effectively Load_LoadStore instructions. So having a single Load_LoadStore (loadFence) instruction at the JVM level seems like a reasonable design decision.

Question 3

The doc for storeFence() is wrong. See https://bugs.openjdk.java.net/browse/JDK-8038978

loadFence() is LoadLoad plus LoadStore, so useful often called acquire fence.

storeFence() is StoreStore plus LoadStore, so useful often called release fence.

LoadLoad LoadStore StoreStore are cheap fences (nop on x86 or Sparc, cheap on Power, maybe expensive on ARM).

IA64 has different instructions for acquire and release semantics.

fullFence() is LoadLoad LoadStore StoreStore plus StoreLoad.

StordLoad fence is expensive (on almost all CPU), almost as expensive as full fence.

That justifies the API design.

Question 4

It looks that storeFence() should be mapped to "loadStore_storeFence" according to the comments from source code:

https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/misc/Unsafe.java#L3422

/**
 * Ensures that loads and stores before the fence will not be reordered with
 * stores after the fence; a "StoreStore plus LoadStore barrier".
 * ...
 * /