High throughput non-blocking server design: Alternatives to busy wait

Question 1

Busy waiting is good if you need to wake up thread #2 as fast as possible. In fact this is the fastest way to notify one processor about changes made by another processor. You need to generate memory fences on both ends (write fence on one side and read fence on the other). But this statement holds true only if your both threads are executed on dedicated processors. In this case no context switching is needed, just cache coherency traffic.

There is some improvements can be made.

If thread #2 is in general CPU bound and do busy waiting - it can be penalized by the scheduler (at least on windows and linux). OS scheduler dynamically adjust thread priorities to improve overall system performance. It reduces priorities of CPU bound threads that consumes large amount of CPU time to prevent thread starvation. You need to manually increase priority of thread #2 to prevent this.
If you have multicore or multiprocessor machine, you will end up with undersubscription of processors and your application won't be able to exploit hardware concurrency. You can mitigate this by using several processor threads (thread #2).

Parallelization of processing step. There is two options.

Your messages is totally ordered and need to be processed in the same order as they arrived.
Messages can be reordered. Processing can be done in any order.

You need N cycle buffers and N processing threads and N output buffers and one consumer thread in first case. Thread #1 enqueues messages in round-robin order in that cycle buffers.

// Thread #1 pseudocode
auto message = recv()
auto buffer_index = atomic_increment(&message_counter);
buffer_index = buffer_index % N;  // N is the number of threads
// buffers is an array of cyclic buffers - Buffer* buffers[N];
Buffer* current_buffer = buffers[buffer_index];
current_buffer->euqueue(message);

Each thread consumes messages from one of the buffers and enqueues result to his dedicated output buffer.

// Thread #i pseudocode
auto message = my_buffer->dequeue();
auto result = process(message);
my_output_buffer->enqueue(result);

Now you need to process all this messages in the arrival order. You can do this with dedicated consumer thread by dequeuing processed messages from output cyclic buffers in round-robin manner.

// Consumer thread pseudocode
// out_message_counter is equal to message_counter at start
auto out_buffer_index = atomic_increment(&out_message_counter);
out_buffer_index = out_buffer_index % N;
// out_buffers is array of output buffers that is used by processing
// threads
auto out_buffer = out_buffers[out_buffer_index];
auto result = out_buffer->dequeue();
send(result);  // or whatever you need to do with result

In second case, when you doesn't need to preserve message order - you doesn't need the consumer thread and output cyclic buffers. You just do whatever you need to do with result in processing thread.

N must be equal num CPU's - 3 in first case ("- 3" is one I/O thread + one consumer thread + one DHT thread) and num CPU's - 2 in second case ("- 2" is one I/O thread + one DHT thread). This is because busy wait can't be effective if you have oversubscription of processors.

Question 2

Sounds like you want to coordinate a producer and consumer connected by some shared state. At least in Java for such patterns, one way to avoid busy wait is to use wait and notify. With this approach, thread #2 can go into a blocked state if it finds that the queue is empty by calling wait and avoid spinning the CPU. Once thread #1 puts some stuff in the queue, it can do a notify. A quick search of such mechanisms in C++ yields this:

wait and notify in C/C++ shared memory

Question 3

You can have thread #2 go to sleep for X miliseconds when the queue is empty.

X can be determined by the length of the queues you want + some guard band.

BTW, in user mode (ring3) you can't use MONITOR/MWAIT instructions which would be ideal for your question.

Edit

You should definitely give TBB's RWlock a try (there's a free version). Sounds like what you're looking for.

Edit2

Another option is to use conditional variables. They involve a mutex and a condition. Basically you wait on the condition to become "true". The low level pthread stuff can be found here.