Increased cost of a volatile write over a nonvolatile write

Question 1

Here's why, according to the article you have indicated:

Volatile writes are considerably more expensive than nonvolatile writes because of the memory fencing required to guarantee visibility but still generally cheaper than lock acquisition.

[...] volatile reads are cheap -- nearly as cheap as nonvolatile reads

And that is, of course, true: memory fence operations are always bound to writing and reads execute the same way regardless of whether the underlying variable is volatile or not.

However, volatile in Java is about much more than just volatile vs. nonvolatile memory read. In fact, in its essence it has nothing to do with that distinction: the difference is in the concurrent semantics.

Consider this notorious example:

volatile boolean runningFlag = true;

void run() {
  while (runningFlag) { do work; }
}

If runningFlag wasn't volatile, the JIT compiler could essentially rewrite that code to

void run() {
   if (runningFlag) while (true) { do work; }
}

The ratio of overhead introduced by reading the runningFlag on each iteration against not reading it at all is, needless to say, enormous.

Question 2

It is about caching. Since new processors use caches, if you don't specify volatile data stays in cache and operation of writing is fast. (Since cache is near processor) If variable is marked as volatile, system needs to write it fully into memory nad that is a bit slower operation.

And yes you are thinking right it has to do something with different thread stacks, since each is separate and reads from SAME memory, but not necessarily from same cache. Today processors use many levels of caching so this can be a big problem if multiple threads/processes are using same data.

EDIT: If data stays in local cache other threads/processes won't see change until data is written back in memory.

Question 3

Most likely it has to do with the fact that a volatile write has to stall the pipeline.

All writes are queued to be written to the caches. You don't see this with non-volatile writes/reads as the code can just get the value you just wrote without involving the cache.

When you use a volatile read, it has to go back to the cache, and this means the write (as implemented) cannot continue under the write has been written to the case (in case you do a write followed by a read)

One way around this is to use a lazy write e.g. AtomicInteger.lazySet() which can be 10x faster than a volatile write as it doesn't wait.