libevent / epoll number of worker threads?

Question 1

Your question is even better than you think! :-P

If you do networking with libevent, it can do non-blocking I/O on sockets. One thread could do this (using one core), but that would under-utilize the CPU.

But if you do “heavy” file I/O, then there is no non-blocking interface to the kernel. (Many systems have nothing to do that at all, others have some half-baked stuff going on in that field, but non-portable. –Libevent doesn’t use that.) – If file I/O is bottle-necking your program/test, then a higher number of threads will make sense! If a hard-disk is used, and the i/o-scheduler is reordering requests to avoid disk-head-moves, etc. it will depend on how much requests the scheduler takes into account to do its job the best. 100 pending requests might work better then 8.

Why shouldn't you increase the thread number?

If non-blocking I/O is done: all cores are working with thread-count = core-count. More threads only means more thread-switching with no gain.

For blocking I/O: you should increase it!

Question 2

Context Switching

For an OS to context switch between threads takes a little bit of time. Having a lot of threads, each one doing comparatively little work, means that the context switch time starts becoming a significant portion of the overall runtime of the application.

For example, it could take an OS about 10 microseconds to do a context switch; if the thread does only 15 microseconds worth of work before going back to sleep then 40% of the runtime is just context switching!

This is inefficient, and that sort of inefficiency really starts to show up when you're up-scaling as your hardware, power and cooling costs go through the roof. Having few threads means that the OS doesn't have to switch contexts anything like as much.

So in your case if your requirement is for the computer to handle 10,000 connections and you have 8 cores then the efficiency sweet spot will be 1250 connections per core.

More Clients Per Thread

In the case of a server handling client requests it comes down to how much work is involved in processing each client. If that is a small amount of work, then each thread needs to handle requests from a number of clients so that the application can handle a lot of clients without having a lot of threads.

In a network server this means getting familiar with the the select() or epoll() system call. When called these will both put the thread to sleep until one of the mentioned file descriptors becomes ready in some way. However if there's no other threads pestering the OS for runtime the OS won't necessarily need to perform a context switch; the thread can just sit there dozing until there's something to do (at least that's my understanding of what OSes do. Everyone, correct me if I'm wrong!). When some data turns up from a client it can resume a lot faster.

And this of course makes the thread's source code a lot more complicated. You can't do a blocking read of data from the clients for instance; being told by epoll() that a file descriptor has become ready for reading does not mean that all the data you're expecting to receive from the client can be read immediately. And if the thread stalls due to a bug that affects more than one client. But that's the price paid for attaining the highest possible efficiency.

And it's not necessarily the case that you would want just 8 threads to go with your 8 cores and 10,000 connections. If there's something that your thread has to do for each connection every time it handles a single connection then that's an overhead that would need to be minimised (by having more threads and fewer connections per thread). [The select() system call is like that, which is why epoll() got invented.] You have to balance that overhead up against the overhead of context switching.

10,000 file descriptors is a lot (too many?) for a single process in Linux, so you might have to have several processes instead of several threads. And then there's the small matter of whether the hardware is fundamentally able to support 10,000 within whatever response time / connection requirements your system has. If it doesn't then you're forced to distribute your application across two or more servers, and that can start getting really complicated!

Understanding exactly how many clients to handle per thread depends on what the processing is doing, whether there's harddisk activity involved, etc. So there's no one single answer; it's different for different applications, and also for the same application on different machines. Tuning the clients / thread to achieve peak efficiency is a really hard job. This is where profiling tools like dtrace on Solaris, ftrace on Linux, (especially when used like this, which I've used a lot on Linux on x86 hardware) etc. can help because they allow you to understand at a very fine scale precisely what runtime is involved in your thread handling a request from a client.

Outfits like Google are of course very keen on efficiency; they get through a lot of electricity. I gather that when Google choose a CPU, hard drive, memory, etc. to put into their famously home grown servers they measure performance in terms of "Searches per Watt". Obviously you have to be a pretty big outfit before you get that fastidious about things, but that's the way things go ultimately.

Other Efficiencies

Handling things like TCP network connections can take up a lot of CPU time in it's own right, and it can be difficult to understand whereabouts in a system all your CPU runtime has gone. For network connections things like TCP offload in the smarter NICs can have a real benefit, because that frees the CPU from the burden of doing things like the error correction calculations.

TCP offload mirrors what we do in the high speed large scale real time embedded signal processing world. The (weird) interconnects that we use require zero CPU time to run them. So all of the CPU time is dedicated to processing data, and specialised hardware looks after moving data around. That brings about some quite astonishing efficiencies, so one can build a system with more modest, lower cost, less power hungry CPUs.

Language can have a radical effect on efficiency too; Things like Ruby, PHP, Perl are all very well and good, but everyone who has used them initially but has then grown rapidly ended up going to something more efficient like Java/Scala, C++, etc.