Question

Those two libraries share the similar philosophy and the similar design decisions as a result. But this popular WSGI benchmark says eventlet is way slower than gevent. What do make their performance so different?

As I know key differences between them are:

  • gevent intentionally depends on and is coupled to libev (libevent, previously) while eventlet defines independent reactor interface and implements particular adapters using select, epoll, and Twisted reactor behind it. Does additional reactor interface make critical performance hits?

  • gevent is mostly written in Cython while eventlet is written in pure Python. Is natively compiled Cython so faster than pure Python, for not-so-much-computational but IO-bound programs?

  • Primitives of gevent emulate standard libraries’ interfaces while eventlet’s primitives differ from standard and provides additional layer to emulate it. Does additional emulation layer makes eventlet slower?

  • Is the implementation of eventlet.wsgi just worse than gevent.pywsgi?

I really wonder, because they overall look so similar for me.

Was it helpful?

Solution

Well, gevent is not "mostly" written in Cython, though some critical sections are.

Cython makes a huge difference. Processor optimizations work much better with compiled code. Branch prediction, for example, falls apart in VM-based systems because the indirection of the branching at a VM execution level is opaque to it. The cache footprint is tighter. Compiled code makes a huge difference here, and IO can be very sensitive to the latency.

In a similar vein, libev is very fast. Same reasons.

It doesn't seem like eventlet should have been using the select hub (Python 2.6 usually defaults to epoll). If it was stuck on select, though, that would make it really slow (because Python has to convert the select fd_set back and forth to a Python list, so it gets ugly when it's in the middle of a loop).

I haven't done any profiling, but I'd be willing to bet that libev / libevent plus Cython makes the big difference. Notably, some of the threading primitives are in Cython in gevent. This is a big deal because a lot of code touches them indirectly through IO and even the standard library in some spots.

As for the additional emulation layer of eventlet, there does appear to be a lot more bounciness. In gevent, the code path seems to construct callbacks and let the hub call them. eventlet appears to do more of the bookkeeping that the hub is doing in gevent. Again, though, I haven't profiled it. As for the monkeypatching itself, they look fairly similar.

The WSGI server is another difficult one. Notably, the header parsing in gevent is deferred to the standard library, whereas they implement it themselves in eventlet. Not sure if this is a big impact or not, but it would be no surprise if there was something lurking in there. Most telling is that eventlet's server is based on a monkeypatched version of the standard library BaseHTTPServer. I can't imagine that this is very optimal. Gevent implements a server that is aware of the emulation.

OTHER TIPS

Sorry for late reply.

There are two main reasons of big performance difference in that benchmark:

  • as stated before, gevent's critical paths are heavily optimized
  • that benchmark does stress testing. It's not IO bound anymore, because it tries to make machine to run as many requests as possible. And that's where Cythonized code shines.

"In real world" that only happens during "slashdot" bursts of traffic. Which is important and one should be ready, but when it happens, you react by adding more servers or disabling resource heavy features. I haven't seen a benchmark that actually adds more servers when the load increases.

If, on the other hand, benchmark would simulate a "normal day" load (which would vary from one website to another) but generally could be approximated to request, random pause, repeat. The less that pause - the more traffic we are simulating. Also the client side of benchmark would have to simulate latency. On Linux this could be done using awesome netem[1], otherwise, by putting small delays before recv/send calls (which would be very hard because benchmarks usually use higher level libraries).

Now if those conditions are met, we would actually benchmark IO bound problems. But results wouldn't be too awesome: all candidates successfully served 10, 50 and even 200 qps loads. Boring, right? So we could measure latency distribution, time to serve 99% requests, etc. Gevent would still show better results. But the difference would be hardly impressing.

[1] Simulate delayed and dropped packets on Linux

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top