Question

When a out-of-order processor encounters something like

LOAD R1, 0x1337
LOAD R2, $R1
LOAD R3, 0x42

Assuming that all accesses will result in a cache miss, can the processor ask the memory controller for the contents of 0x42 before the it asks for the content of $R1 or even 0x1337? If so, assuming that accessing $R1 will result in a exception (e.g., segmentation fault), we can consider that 0x42 was loaded speculatively, correct?

And by the way, when a load-store unit sends a request to the memory controller, can it send a second request before receiving the answer to the previous one?

My question doesn't target any architecture in particular. Answers related to any mainstream architecture are welcomed.

Was it helpful?

Solution

Answer to your question depends on the memory ordering model of your CPU, which is not the same as the CPU allowing out of order execution. If the CPU implements Total store ordering (eg x86 or Sparc) then the answer to your question is 0x42 will not be loaded before 0x1337

If the cpu implements a relaxed memory model (eg IA-64, PowerPC, alpha), then in the absence of a memory fence instruction all bets are off as to which will be accessed first. This should be of little relevance unless you are doing IO, or dealing with multi-threaded code.

you should note that some CPU's (eg Itanium) do have relaxed memory models (so reads may be out of order) but do NOT have any out of order execution logic since they expect the compiler to order the instructions and speculative instructions in an optimal way rather than spend silicon space on OOE

OTHER TIPS

This would seem to be the a logical conclusion for superscalor CPUs with multiple load-store units too. Multi-channel memory controllers are pretty common these days.

In the case of out-of-order instruction execution, an enormous amount of logic is expended in determining whether instructions have dependancies on others in the stream - not just register dependancies but also operations on memory as well. There's also an enormous amount of logic for handling exceptions: the CPU needs to complete all instructions in the stream up to the fault (or alternatively, offload some parts of this onto the operating system).

In terms of the programming model seen by most applications, the effects are never apparent. As seen by memory, it's implicit that loads will not always happen in the sequence expected - but this is the case any way when caches are in use.

Clearly, in circumstances where the order of loads and stores does matter - for instance in accessing device registers, OOE must be disabled. The POWER architecture has the wonderful EIEIO instruction for this purpose.

Some members of the ARM Cortex-A family offer OOE - I suspect with the power constraints of these devices, and the apparent lack of instructions for forcing ordering, that load-stores always complete in order

A compliant SPARC processor must implement TSO but may also implement RMO and PSO. You need to know what mode your OS is running in unless you happen to know your specific hardware platform has not implemented RMO and PSO.

Related for x86: Why flush the pipeline for Memory Order Violation caused by other logical processors?. The observable result will obey x86 ordering rules, but microarchitecturally yes it can load early. (And of course that's from cache; HW prefetch is different).

OoO exec CPUs truly do reorder load execution if the address isn't ready for one load. Or if it misses in cache, then later loads can run before data arrives for this one. But on x86, to maintain correctness wrt. the strong memory model (program order + a store buffer with store forwarding), the core checks if the eventual result was legal according to the ISA's on-paper memory-model guarantees. (i.e. that the cache line loaded from earlier is still valid and thus still contains the data we're now allowed to load). If not, nuke the in-flight instructions that depended on this possibly-unsafe speculation and roll back to a known safe state.

So modern x86 gets the perf of relaxed load ordering (most of the time) while still maintaining the memory-model rules where every load is effectively an acquire load. But at the cost of pipeline nukes if you do something the pipeline doesn't like, e.g. false sharing (which is already bad enough).

Other CPUs with a strong memory model (Sparc TSO) might not be this aggressive. Weak memory models allow later loads to complete early.

Of course this is reading from cache; demand-load requests are seen by the / a memory controller only on cache miss. But HW prefetchers can access memory asynchronously from the CPU; that's how they get data into cache ahead of when the CPU runs an instruction that loads it, ideally avoiding a cache miss at all.


And yes, the memory subsystem is pipelined, like 12 to 16 outstanding requests per core in Skylake. (12 LFBs for L1<->L2, and IIRC 16 superqueue entries in the L2.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top