Question

I've developed a mini HTTP server in C++, using boost::asio, and now I'm load testing it with multiple clients and I've been unable to get close to saturating the CPU. I'm testing on a Amazon EC2 instance, and getting about 50% usage of one cpu, 20% of another, and the remaining two are idle (according to htop).

Details:

  • The server fires up one thread per core
  • Requests are received, parsed, processed, and responses are written out
  • The requests are for data, which is read out of memory (read-only for this test)
  • I'm 'loading' the server using two machines, each running a java application, running 25 threads, sending requests
  • I'm seeing about 230 requests/sec throughput (this is application requests, which are composed of many HTTP requests)

So, what should I look at to improve this result? Given the CPU is mostly idle, I'd like to leverage that additional capacity to get a higher throughput, say 800 requests/sec or whatever.

Ideas I've had:

  • The requests are very small, and often fulfilled in a few ms, I could modify the client to send/compose bigger requests (perhaps using batching)
  • I could modify the HTTP server to use the Select design pattern, is this appropriate here?
  • I could do some profiling to try to understand what the bottleneck's are/is
Was it helpful?

Solution

boost::asio is not as thread-friendly as you would hope - there is a big lock around the epoll code in boost/asio/detail/epoll_reactor.hpp which means that only one thread can call into the kernel's epoll syscall at a time. And for very small requests this makes all the difference (meaning you will only see roughly single-threaded performance).

Note that this is a limitation of how boost::asio uses the Linux kernel facilities, not necessarily the Linux kernel itself. The epoll syscall does support multiple threads when using edge-triggered events, but getting it right (without excessive locking) can be quite tricky.

BTW, I have been doing some work in this area (combining a fully-multithreaded edge-triggered epoll event loop with user-scheduled threads/fibers) and made some code available under the nginetd project.

OTHER TIPS

As you are using EC2, all bets are off.

Try it using real hardware, and then you might be able to see what's happening. Trying to do performance testing in VMs is basically impossible.

I have not yet worked out what EC2 is useful for, if someone find out, please let me know.

From your comments on network utilization,
You do not seem to have much network movement.

3 + 2.5 MiB/sec is around the 50Mbps ball-park (compared to your 1Gbps port).

I'd say you are having one of the following two problems,

  1. Insufficient work-load (low request-rate from your clients)
    • Blocking in the server (interfered response generation)

Looking at cmeerw's notes and your CPU utilization figures
(idling at 50% + 20% + 0% + 0%)
it seems most likely a limitation in your server implementation.
I second cmeerw's answer (+1).

230 requests/sec seems very low for such simple async requests. As such, using multiple threads is probably premature optimisation - get it working properly and tuned in a single thread, and see if you still need them. Just getting rid of un-needed locking may get things up to speed.

This article has some detail and discussion on I/O strategies for web server-style performance circa 2003. Anyone got anything more recent?

ASIO is fine for small to medium tasks but it isn't very good at leveraging the power of the underlying system. Neither are raw socket calls, or even IOCP on Windows but if you are experienced you will always be better than ASIO. Either way there is a lot of overhead with all of those methods, just more with ASIO.

For what it is worth. using raw socket calls on my custom HTTP can serve 800K dynamic requests per second with a 4 core I7. It is serving from RAM, which is where you need to be for that level of performance. At this level of performance the network driver and OS are consuming about 40% of the CPU. Using ASIO I can get around 50 to 100K requests per second, its performance is quite variable and mostly bound in my app. The post by @cmeerw mostly explains why.

One way to improve performance is by implementing a UDP proxy. Intercepting HTTP requests and then routing them over UDP to your backend UDP-HTTP server you can bypass a lot of TCP overhead in the operating system stacks. You can also have front ends which pipe through on UDP themselves, which shouldn't be too hard to do yourself. An advantage of a HTTP-UDP proxy is that it allows you to use any good frontend without modification, and you can swap them out at will without any impact. You just need a couple more servers to implement it. This modification on my example lowered the OS CPU usage to 10%, which increased my requests per second to just over a million on that single backend. And FWIW You should always have a frontend-backend setup for any performant site because the frontends can cache data without slowing down the more important dynamic requests backend.

The future seems to be writing your own driver that implements its own network stack so you can get as close to the requests as possible and implement your own protocol there. Which probably isn't what most programmers want to hear as it is more complicated. In my case I would be able to use 40% more CPU and move to over 1 million dynamic requests per second. The UDP proxy method can get you close to optimal performance without needing to do this, however you will need more servers - though if you are doing this many requests per second you will usually need multiple network cards and multiple frontends to handle the bandwidth so having a couple lightweight UDP proxies in there isn't that big a deal.

Hope some of this can be useful to you.

How many instances of io_service do you have? Boost asio has an example that creates an io_service per CPU and use them in the manner of RoundRobin.

You can still create four threads and assign one per CPU, but each thread can poll on its own io_service.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top