Busy waiting is good if you need to wake up thread #2 as fast as possible. In fact this is the fastest way to notify one processor about changes made by another processor. You need to generate memory fences on both ends (write fence on one side and read fence on the other). But this statement holds true only if your both threads are executed on dedicated processors. In this case no context switching is needed, just cache coherency traffic.
There is some improvements can be made.
- If thread #2 is in general CPU bound and do busy waiting - it can be penalized by the scheduler (at least on windows and linux). OS scheduler dynamically adjust thread priorities to improve overall system performance. It reduces priorities of CPU bound threads that consumes large amount of CPU time to prevent thread starvation. You need to manually increase priority of thread #2 to prevent this.
- If you have multicore or multiprocessor machine, you will end up with undersubscription of processors and your application won't be able to exploit hardware concurrency. You can mitigate this by using several processor threads (thread #2).
Parallelization of processing step. There is two options.
- Your messages is totally ordered and need to be processed in the same order as they arrived.
- Messages can be reordered. Processing can be done in any order.
You need N cycle buffers and N processing threads and N output buffers and one consumer thread in first case. Thread #1 enqueues messages in round-robin order in that cycle buffers.
// Thread #1 pseudocode
auto message = recv()
auto buffer_index = atomic_increment(&message_counter);
buffer_index = buffer_index % N; // N is the number of threads
// buffers is an array of cyclic buffers - Buffer* buffers[N];
Buffer* current_buffer = buffers[buffer_index];
current_buffer->euqueue(message);
Each thread consumes messages from one of the buffers and enqueues result to his dedicated output buffer.
// Thread #i pseudocode
auto message = my_buffer->dequeue();
auto result = process(message);
my_output_buffer->enqueue(result);
Now you need to process all this messages in the arrival order. You can do this with dedicated consumer thread by dequeuing processed messages from output cyclic buffers in round-robin manner.
// Consumer thread pseudocode
// out_message_counter is equal to message_counter at start
auto out_buffer_index = atomic_increment(&out_message_counter);
out_buffer_index = out_buffer_index % N;
// out_buffers is array of output buffers that is used by processing
// threads
auto out_buffer = out_buffers[out_buffer_index];
auto result = out_buffer->dequeue();
send(result); // or whatever you need to do with result
In second case, when you doesn't need to preserve message order - you doesn't need the consumer thread and output cyclic buffers. You just do whatever you need to do with result in processing thread.
N must be equal num CPU's
- 3 in first case ("- 3" is one I/O thread + one consumer thread + one DHT thread) and num CPU's
- 2 in second case ("- 2" is one I/O thread + one DHT thread). This is because busy wait can't be effective if you have oversubscription of processors.