Is dereferencing a READ ONLY non-atomic pointer to an atomic object in different threads safe?

Question 1

It depends what is around your two accesses. If the master writes some data just before setting the boolean, the slave needs a memory barrier to make sure it will not read the said data before the boolean.

Maybe for now your thread is just waiting on this boolean to exit, but if one day you decide the master should, for instance, pass a termination status to the slaves, you code might break.
If you come back 6 months later and modify this piece of code, are you certain you will remember that the area beyond your slave loop is a no-shared-read zone and the one before you master boolean update a no-shared-write zone?

At any rate, your boolean would need to be volatile, or else the compiler might optimize it away. Or worse, your coworker's compiler might, while you'll be off laying another piece of unreliable code.

It is a well known fact volatile variables are usually not good enough for thread synchronization, because they don't implement memory barriers, as in this simple example:.

master :

// previous value of x = 123
x = 42;
*p = true;

bus logic on slave processor:

write *p = true

slave:

while (!*p) { /* whatever */ }
the_answer = x; // <-- boom ! the_answer = 123

bus logic on slave's processor:

write x = 42 // too late...

(symetric problem if the master's bus writes are scheduled out of order)

Of course, chances are you will never witness such a rare occurence on your particular desktop computer, just like you could run by chance a program vandalizing its own memory without ever crashing.

Nevertheless, software written with such leaky synchronizations are ticking timebombs. Compile and run them long enough on a host of bus architecture and one day... Ka-boom!

As a matter of fact, C++11 is hurting multiprocessor programming a lot by allowing to create tasks like if there was nothing to it, and in the same time offering nothing but crappy atomics, mutexes and conditions variables to handle the synchronization (and the bloody awkward futures, of course).

The simplest and most efficient way to synchronize tasks (especially worker threads) is to have them process messages on a queue. That is how drivers and real-time software work, and so should any multiprocessor application unless some extraordinary performance requirements show up.

Forcing programmers to control multitasking with glorified flags is stupid. You need to understand very clearly how the hardware works to play around with atomic counters.
The pedantic clique of C++ is again forcing every man and his dog to become experts in yet another field just to avoid writing crappy, unreliable code.

And as usual, you will have the gurus spouting their "good practices" with an indulgent smile, while people burn megajoules of CPU power in stupid spinning loops inside broken homebrewed queues in the belief that "no-wait" synchronization is the alpha and omega of efficiency.

And this performance obsession is a non-issue. "blocking" calls are consuming nothing but crumbs of the available computational power, and there is a number of other factors that hurt performances by a couple of orders of magnitude above operating system synchronization primitives (the absence of a standard way to locate tasks on a given processor, for a start).

Consider your thread1 slave. Accessing an atomic bool will throw a handful of sand into the bus cache cogwheels, slowing down this particulmar access by a factor of about 20. THat is a few dozen cycles wasted. Unless your slave is just twiddling its virtual thumbs inside the loop, this handful of cycles will be dwarved by the thousands or millions a single loop will last. Also, what will happen if your slave is done working while its brother slaves are not ? Will it spin uselessly on this flag and waste CPU, or block on whatever mutex?
That is exactly to address these problems that message queues were invented.

A proper OS call like a message queue read would maybe consume a couple of hundred cycles. So what?
If your slave thread is just there to increment 3 counters, then it is your design that is at fault. You don't launch a thread to move a couple of matchsticks, just like you don't allocate your memory byte per byte, even is such a high level language as C++.

Provided you don't use threads to munch breadcrumbs, you should rely on simple and proven mechanisms like waiting queues or semaphores or events (picking the posix or Microsot ones for lack of a portable solution), and you would not notice any impact on performances whatsoever.

EDIT: more on system calls overhead

Basically, a call to a waiting queue will cost a few microseconds.

Assuming your average worker crunches number for 10 to 100 ms, the system call overhead will be indiscernable from background noise, and the thread termination responsiveness will stay within acceptable limits ( < 0.1 s).

I recently implemented a Mandelbrot set explorer as a test case for parallel processing. It is in no way representative of all parallel processing cases, but still I noticed a few interesting things.

On my I3 Intel 2 cores / 4 CPUs @3.1 GHz, using one worker per CPU, I measured the gain factor (i.e. the ratio of execution times using 1 core over 4 cores) of parallelization of pure computing (i.e. with no data dependency whatsoever between workers).

localizing the threads on one core each (instead of letting the OS scheduler move the threads from one core to another) boosted the ratio from 3.2 to 3.5 (out of a theoretical max of 4)
beside locking threads to distinct cores, the most notable improvements were due to optimizations of the algorithm itself (more efficient computations and better load balancing).
the cost of about 1000 C++11 mutex locks used to let the 4 workers draw from a common queue amounted to 7 ms, i.e. 7 µs per call.

I can hardly imagine a high performance design doing more than 1000 synchronizations per second (or else your time might be better spent working to improve the design), so basically your "blocking" calls would cost well under 1% of the power available on a rather low-cost PC.
The choice is yours, but I am not sure implementing raw atomic objects right from the start will be the decisive factor in performances.

I would advise to start with simple queues and do some benchmarking. You can use the pthread posix interface, or take for instance this pretty good sample as a base for a conversion to C++11.

You can then debug your program and evaluate the performances of your algorithms in a syncronization-bugs-free environment.

If the queues prove to be the real CPU hogs and your algorithm cannot be refactored to avoid excessive synchronization calls, it should be relatively easy to switch to whatever spinlocks you assume to be more efficient, especially if your computations have been streamlined and data dependencies sorted out beforehand.

P.S: if that's not a trade secret, I would be glad to hear more about this algorithm of yours.

Question 2

Yes, it is safe. You can't have a data race without at least one thread modifying the shared variable. Since neither thread modifies p, there is no race.

Question 3

The code you posted and the question are two different things.

The code will work because you do not dereference a non-atomic pointer. You dereference a std::atomic<bool>* which will result (operator overloading) in a sequentially consistent fetch/store. This is probably less efficient than necessary (most of the time such a flag is used for a release operation), but it is safe.

Otherwise, dereferencing a valid non-atomic pointer to whatever (including an atomic variable) is safe as long as no other thread modifies the data.

Dereferencing a non-atomic pointer with another thread writing to it is still "safe" insofar as it will not crash. There are however no formal guarantees that memeory is not garbled (for aligned PODs there is a very practical guarantee due to how processors access memory, though), but more importantly it is unsafe insofar as there are no memory ordering guarantees. When using such a flag, one normally does something like this:

do_work(&buf); // writes data to buf
done = true;   // synchronize

This works as intended with one thread, but it is not guaranteed to work properly in presence of concurrency. For that, you need a happens-before guarantee. Otherwise, it is possible that the other thread picks up the update to the flag before the write to the data has been realized.

Question 4

dereferencing (that is: reading an address) is atomic on intel architectures. Furthermore, since constant, I guess that it is going to be correct no only on Intel/AMD. However look at this post for more information.

clarification: it is possible on other architectures that a thread is switched out while writing to an address, when only part of the address is modified, so the address read by the other thread would be invalid.

With Intel this cannot happen if the address is aligned in memory.

Furthermore, since *p is a std::atomic<bool>, it already implements all that is needed (native, asm, memory fences).