Question

It's common that modern CPU architectures employ performance optimizations that can result in out-of-order execution. In single threaded applications memory reordering may also occur, but it's invisible to programmers as if memory was accessed in program order. And for SMP, memory barriers come to the rescue which are used to enforce some sort of memory ordering.

What I'm not sure, is about multi-threading in a uniprocessor. Consider the following example: When thread 1 runs, the store to f could take place before the store to x. Let's say context switch happens after f is written, and right before x is written. Now thread 2 starts to run, and it ends the loop and print 0, which is undesirable of course.

// Both x, f are initialized w/ 0.
// Thread 1
x = 42;
f = 1;

// Thread 2
while (f == 0)
  ;
print x;

Is the scenario described above possible? Or is there a guarantee that physical memory is committed during thread context switch?

According to this wiki,

When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program execute as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.

Although it didn't explicitly mention uniprocessor multi-threaded applications, it includes this case.

I'm not sure it's correct/complete or not. Note that this may highly depend on the hardware(weak/strong memory model). So you may want to include the hardware you know in the answers. Thanks.

PS. device I/O, etc are not my concern here. And it's a single-core uniprocessor.

Edit: Thanks Nitsan for the reminder, we assume no compiler reordering here(just hardware reordering), and loop in thread 2 is not optimized away..Again, devil is in the details.

Was it helpful?

Solution

As a C++ question the answer must be that the program contains a data race, so the behavior is undefined. In reality that means that it could print something other than 42.

That is independent of the underlying hardware. As has been pointed out, the loop can be optimized away and the compiler can reorder the assignments in thread 1, so that result can occur even on uniprocessor machines.

[I'll assume that with "uniprocessor" machine, you mean processors with a single core and hardware thread.]

You now say, that you want to assume compiler reordering or loop elimination does not happen. With this, we have left the realm of C++ and are really asking about corresponding machine instructions. If you want to eliminate compiler reordering, we can probably also rule out any form of SIMD instructions and consider only instructions operating on a single memory location at a time.

So essentially thread1 has two store instructions in the order store-to-x store-to-f, while thread2 has test-f-and-loop-if-not-zero (this may be multiple instructions, but involves a load-from-f) and then a load-from-x.

On any hardware architecture I am aware of or can reasonably imagine, thread 2 will print 42.

One reason is that, if instructions processed by a single processors are not sequentially consistent among themselves, you could hardly assert anything about the effects of a program.

The only event that could interfere here, is an interrupt (as is used to trigger a preemptive context switch). A hypothetical machine that stores the entire state of its current execution pipeline state upon an interrupt and restores it upon return from the interrupt, could produce a different result, but such a machine is impractical and afaik does not exist. These operations would create quite a bit of additional complexity and/or require additional redundant buffers or registers, all for no good reason - except to break your program. Real processors either flush or roll back the current pipeline upon interrupt, which is enough to guarantee sequential consistency for all instructions on a single hardware thread.

And there is no memory model issue to worry about. The weaker memory models originate from the separate buffers and caches that separate the separate hardware processors from the main memory or nth level cache they actually share. A single processor has no similarly partitioned resources and no good reason to have them for multiple (purely software) threads. Again there is no reason to complicate the architecture and waste resources to make the processor and/or memory subsystem aware of something like separate thread contexts, if there aren't separate processing resources (processors/hardware threads) to keep these resources busy.

OTHER TIPS

A strong memory ordering execute memory access instructions with the exact same order as defined in the program, it is often referred as "program ordering".

Weaker memory ordering may be employed to allow the processor reorder memory access for better performance, it is often referred as "processor ordering".

AFAIK, the scenario described above is NOT possible in the Intel ia32 architecture, whose processor ordering outlaws such cases. The relevant rules are (intel ia-32 software development manual Vol3A 8.2 Memory Ordering) :

writes are not reordered with other writes, with the exception of streaming stores, CLFLUSH and string operations.

To illustrate the rule, it gives an example similar to this:

memory location x, y, initialized to 0;

thread 1:

mov [x] 1
mov [y] 1

thread 2:

mov r1 [y]
mov r2 [x]

r1 == 1 and r2 == 0 is not allowed

In your example, thread 1 cannot store f before storing x.

@Eric in respond to your comments.

fast string store instruction "stosd", may store string out of order inside its operation. In a multiprocessor environment, when a processor store a string "str", another processor may observe str[1] being written before str[0], while the logic order presumed to be writing str[0] before str[1];

But these instructions are not reorder with any other stores. and must have precise exception handling. When exception occurs in the middle of stosd, the implementation may choose to delay it so that all out-of-order sub-stores (doesn't necessarily mean the whole stosd instruction) must commit before the context switch.

Edited to address the claims made on as if this is a C++ question:

Even this is considered in the context of C++, As I understand, a standard confirming compiler should NOT reorder the assignment of x and f in thread 1.

$1.9.14 Every value computation and side effect associated with a full-expression is sequenced before every value computation and side effect associated with the next full-expression to be evaluated.

This isn't really a C or C++ question, since you've explicitly assumed no load/store re-ordering, which compilers for both languages are perfectly allowed to do.

Allowing that assumption for the sake of argument, note that loop may anyway never exit, unless you either:

  • give the compiler some reason to believe f may change (eg, by passing its address to some non-inlineable function which could modify it)
  • mark it volatile, or
  • make it an explicitly atomic type and request acquire semantics

On the hardware side, your worry about physical memory being "committed" during a context switch isn't an issue. Both software threads share the same memory hardware and cache, so there's no risk of inconsistency there whatever consistency/coherence protocol pertains between cores.

Say both stores were issued, and the memory hardware decides to re-order them. What does this really mean? Perhaps f's address is already in cache, so it can be written immediately, but x's store is deferred until that cache line is fetched. Well, a read from x is dependent on the same address, so either:

  • the load can't happen until the fetch happens, in which case a sane implementation must issue the queued store before the queued load
  • or the load can peek into the queue and fetch x's value without waiting for the write

Consider anyway that the kernel pre-emption required to switch threads will itself issue whatever load/store barriers are required for consistency of the kernel scheduler state, and it should be obvious that hardware re-ordering can't be a problem in this situation.


The real issue (which you're trying to avoid) is your assumption that there is no compiler re-ordering: this is simply wrong.

You would only need a compiler fence. From the Linux kernel docs on Memory Barriers (link):

SMP memory barriers are reduced to compiler barriers on uniprocessor compiled systems because it is assumed that a CPU will appear to be self-consistent, and will order overlapping accesses correctly with respect to itself.

To expand on that, the reason why synchronization is not required at the hardware level is that:

  1. All threads on a uniprocessor system share the same memory and thus there are no cache-coherency issues (such as propagation latency) that can occur on SMP systems, and

  2. Any out-of-order load/store instructions in the CPU's execution pipeline would either be committed or rolled back in full if the pipeline is flushed due to a preemptive context switch.

This code may well never finish(in thread 2) as the compiler can decide to hoist the whole expression out of the loop(this is similar to using an isRunning flag which is not volatile). That said you need to worry about 2 types of re-orderings here: compiler and CPU, both are free to move the stores around. See here: http://preshing.com/20120515/memory-reordering-caught-in-the-act for an example. At this point the code you describe above is at the mercy of compiler, compiler flags, and particular architecture. The wiki quoted is misleading as it may suggest internal re-ordering is not at the mercy of the cpu/compiler which is not the case.

As far as the x86 is concerned, the out-of-order-stores are made consistent from the viewpoint of the executing code with regards to program flow. In this case, "program flow" is just the flow of instructions that a processor executes, not something constrained to a "program running in a thread". All the instructions necessary for context switching, etc. are considered part of this flow so the consistency is maintained across threads.

A context switch has to store the complete machine state so that it can be restored before the suspended thread resumes execution. The machine states includes the processor registers but not the processor pipeline.

If you assume no compiler reordering, this means that all hardware instructions that are "on-the-fly" have to be completed before a context switch (i.e. an interrupt) takes place, otherwise they get lost and are not stored by the context switch mechanism. This is independend of hardware reordering.

In your example, even if the processor swaps the two hardware instructions "x=42" and "f=1", the instruction pointer is already after the second one, and therefore both instructions must be completed before the context switch begins. if it were not so, since the content of the pipeline and of the cache are not part of the "context", they would be lost.

In other words, if the interrupt that causes the ctx switch happens when the IP register points at the instruction following "f=1", then all instructions before that point must have completed all their effects.

From my point of view, processor fetch instructions one by one. In your case, if "f = 1" was speculatively executed before "x = 42", that means both these two instructions are already in the processor's pipeline. The only possible way to schedule current thread out is interrupt. But the processor(at least on X86) will flush the pipeline's instructions before serving the interrupt. So no need to worry about the reordering in a uniprocessor.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top