disruptor performance issues when using two layers of multiple handlers in a pool

https://stackoverflow.com/questions/17920976

disruptor-pattern

04-06-2022
|

Question

i'm trying to use disruptor to process messages. i need two phases of processing. i.e. two groups of handlers working in a worker pool like this (i guess):

disruptor.
handleEventsWithWorkerPool(
    firstPhaseHandlers)
.thenHandleEventsWithWorkerPool(
    secondPhaseHandlers);

when using the code above, if i put more than one worker in each group, the performance deteriorates. meaning tons of CPU wasted for the exact same amount of work.

i tried to tweak with the ring buffer size (which i already saw has an impact on performance) but in this case it didn't help. so am i doing something wrong, or is this a real problem?

i'm attaching a full demo of the issue.

import java.util.ArrayList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicLong;
import com.lmax.disruptor.EventFactory;
import com.lmax.disruptor.EventTranslatorOneArg;
import com.lmax.disruptor.WorkHandler;
import com.lmax.disruptor.dsl.Disruptor;

final class ValueEvent {
private long value;

public long getValue() {
    return value;
}

public void setValue(long value) {
    this.value = value;
}

public final static EventFactory<ValueEvent> EVENT_FACTORY = new     EventFactory<ValueEvent>() {
    public ValueEvent newInstance() {
        return new ValueEvent();
    }
};
}

class MyWorkHandler implements WorkHandler<ValueEvent> {

AtomicLong workDone;
public MyWorkHandler (AtomicLong wd)
{
    this.workDone=wd;
}
public void onEvent(final ValueEvent event) throws Exception {

    workDone.incrementAndGet();
}

}

class My2ndPahseWorkHandler implements WorkHandler<ValueEvent> {


AtomicLong workDone;
public My2ndPahseWorkHandler (AtomicLong wd)
{
    this.workDone=wd;
}

public void onEvent(final ValueEvent event) throws Exception {

    workDone.incrementAndGet();
}

}

class MyEventTranslator implements EventTranslatorOneArg<ValueEvent, Long> {

@Override
public void translateTo(ValueEvent event, long sequence, Long value) {

    event.setValue(value);

}

}

public class TwoPhaseDisruptor {

static AtomicLong workDone=new AtomicLong(0);

@SuppressWarnings("unchecked")
public static void main(String[] args) {

    ExecutorService exec = Executors.newCachedThreadPool();

    int numOfHandlersInEachGroup=Integer.parseInt(args[0]);
    long eventCount=Long.parseLong(args[1]);
    int ringBufferSize=2 << (Integer.parseInt(args[2]));


    Disruptor<ValueEvent> disruptor = new Disruptor<ValueEvent>(
            ValueEvent.EVENT_FACTORY, ringBufferSize,
            exec);

    ArrayList<MyWorkHandler> handlers = new ArrayList<MyWorkHandler>();
    for (int i = 0; i < numOfHandlersInEachGroup ; i++) {

        handlers.add(new MyWorkHandler(workDone));
    }

    ArrayList<My2ndPahseWorkHandler > phase2_handlers = new ArrayList<My2ndPahseWorkHandler >();
    for (int i = 0; i < numOfHandlersInEachGroup; i++) {
        phase2_handlers.add(new My2ndPahseWorkHandler(workDone));
    }

    disruptor
            .handleEventsWithWorkerPool(
                    handlers.toArray(new WorkHandler[handlers.size()]))
            .thenHandleEventsWithWorkerPool(
                    phase2_handlers.toArray(new WorkHandler[phase2_handlers.size()]));

    long s = (System.currentTimeMillis());
    disruptor.start();

    MyEventTranslator myEventTranslator = new MyEventTranslator();
    for (long i = 0; i < eventCount; i++) {
        disruptor.publishEvent(myEventTranslator, i);
    }

    disruptor.shutdown();
    exec.shutdown();
    System.out.println("time spent "+ (System.currentTimeMillis() - s) + "     ms");
    System.out.println("amount of work done "+ workDone.get());
}
}

try running the above example with 1 thread in each group

1 100000 7

on my computer it gave

time spent 371 ms
amount of work done 200000

Then try it with 4 threads in each group

4 100000 7

which on my computer gave

time spent 9853 ms
amount of work done 200000

during the run the CPU is at 100% utilization

Solution

You seem to be false sharing the AtomicLong between the threads/cores. I'll try it out when I have more time later with a demo, however - much better would be to have each WorkHandler with a private variable that each thread owns (either it's own AtomicLong or preferably a plain long).

Update:

If you change your Disruptor line to:

Disruptor<ValueEvent> disruptor = new Disruptor<ValueEvent>(
        ValueEvent.EVENT_FACTORY, ringBufferSize,
        exec,
        com.lmax.disruptor.dsl.ProducerType.SINGLE,
        new com.lmax.disruptor.BusySpinWaitStrategy());

You'll get much better results:

jason@debian01:~/code/stackoverflow$ java -cp disruptor-3.1.1.jar:. TwoPhaseDisruptor 4 100000 1024
time spent 2728     ms
amount of work done 200000

I reviewed the code and tried to fix false sharing, but found little improvement. That's when I noticed on my 8core that the CPUs were nowhere near 100% (even for the four-worker test). From this I determined, at least, that a yielding/spinning wait strategy will bring reduced latency if you have CPU to burn.

Just make sure you have at least 8 cores (you'll need 8 for processing, plus one for publishing the messages).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow