How is LongAccumulator implemented, so that it is more efficient?

Question 1

That's a very good question, because it shows a very important characteristic of concurrent programming with shared memory. Before going into details, I have to make a step back. Take a look at the following class:

class Accumulator {
    private final AtomicLong value = new AtomicLong(0);
    public void accumulate(long value) {
        this.value.addAndGet(value);
    }
    public long get() {
        return this.value.get();
    }
}

If you create one instance of this class and invoke the method accumulate(1) from one thread in a loop, then the execution will be really fast. However, if you invoke the method on the same instance from two threads, the execution will be about two magnitudes slower.

You have to take a look at the memory architecture to understand what happens. Most systems nowadays have a non-uniform memory access. In particular, each core has its own L1 cache, which is typically structured into cache lines with 64 octets. If a core executes an atomic increment operation on a memory location, it first has to get exclusive access to the corresponding cache line. That's expensive, if it has no exclusive access yet, due to the required coordination with all other cores.

There's a simple and counter-intuitive trick to solve this problem. Take a look at the following class:

class Accumulator {
    private final AtomicLong[] values = {
        new AtomicLong(0),
        new AtomicLong(0),
        new AtomicLong(0),
        new AtomicLong(0),
    };
    public void accumulate(long value) {
        int index = getMagicValue();
        this.values[index % values.length].addAndGet(value);
    }
    public long get() {
        long result = 0;
        for (AtomicLong value : values) {
            result += value.get();
        }
        return result;
    }
}

At first glance, this class seems to be more expensive due to the additional operations. However, it might be several times faster than the first class, because it has a higher probability, that the executing core already has exclusive access to the required cache line.

To make this really fast, you have to consider a few more things:

The different atomic counters should be located on different cache lines. Otherwise you replace one problem with another, namely false sharing. In Java you can use a long[8 * 4] for that purpose, and only use the indexes 0, 8, 16 and 24.
The number of counters have to be chosen wisely. If there are too few different counters, there are still too many cache switches. if there are too many counters, you waste space in the L1 caches.
The method getMagicValue should return a value with an affinity to the core id.

To sum up, LongAccumulator is more efficient for some use cases, because it uses redundant memory for frequently used write operations, in order to reduce the number of times, that cache lines have to be exchange between cores. On the other hand, read operations are slightly more expensive, because they have to create a consistent result.

Question 2

by this

http://codenav.org/code.html?project=/jdk/1.8.0-ea&path=/Source%20Packages/java.util.concurrent.atomic/LongAccumulator.java

it looks like a spin lock.