You seem to be false sharing the AtomicLong between the threads/cores. I'll try it out when I have more time later with a demo, however - much better would be to have each WorkHandler with a private variable that each thread owns (either it's own AtomicLong or preferably a plain long).
Update:
If you change your Disruptor line to:
Disruptor<ValueEvent> disruptor = new Disruptor<ValueEvent>(
ValueEvent.EVENT_FACTORY, ringBufferSize,
exec,
com.lmax.disruptor.dsl.ProducerType.SINGLE,
new com.lmax.disruptor.BusySpinWaitStrategy());
You'll get much better results:
jason@debian01:~/code/stackoverflow$ java -cp disruptor-3.1.1.jar:. TwoPhaseDisruptor 4 100000 1024
time spent 2728 ms
amount of work done 200000
I reviewed the code and tried to fix false sharing, but found little improvement. That's when I noticed on my 8core that the CPUs were nowhere near 100% (even for the four-worker test). From this I determined, at least, that a yielding/spinning wait strategy will bring reduced latency if you have CPU to burn.
Just make sure you have at least 8 cores (you'll need 8 for processing, plus one for publishing the messages).