Why is a pthread mutex considered “slower” than a futex?

https://stackoverflow.com/questions/6364314

28-10-2019
|

Question

Why are POSIX mutexes considered heavier or slower than futexes? Where is the overhead coming from in the pthread mutex type? I've heard that pthread mutexes are based on futexes, and when uncontested, do not make any calls into the kernel. It seems then that a pthread mutex is merely a "wrapper" around a futex.

Is the overhead simply in the function-wrapper call and the need for the mutex function to "setup" the futex (i.e., basically the setup of the stack for the pthread mutex function call)? Or are there some extra memory barrier steps taking place with the pthread mutex?

Solution

Because they stay in userspace as much as possible, which means they require fewer system calls, which is inherently faster because the context switch between user and kernel mode is expensive.

I assume you're talking about kernel threads when you talk about POSIX threads. It's entirely possible to have an entirely userspace implementation of POSIX threads which require no system calls but have other issues of their own.

My understanding is that a futex is halfway between a kernel POSIX thread and a userspace POSIX thread.

OTHER TIPS

Futexes were created to improve the performance of pthread mutexes. NPTL uses futexes, LinuxThreads predated futexes, which I think is where the "slower" consideration comes. NPTL mutexes may have some additional overhead, but it shouldn't be much.

Edit: The actual overhead basically consists on:

selecting the correct algorithm for the mutex type (normal, recursive, adaptive, error-checking; normal, robust, priority-inheritance, priority-protected), where the code heavily hints to the compiler that we are likely using a normal mutex (so it should convey that to the CPU's branch prediction logic),
and a write of the current owner of the mutex if we manage to take it which should normally be fast, since it resides in the same cache-line as the actual lock which we have just taken, unless the lock is heavily contended and some other CPU accessed the lock between the time we took it and when we attempted to write the owner (this write is unneeded for normal mutexes, but needed for error-checking and recursive mutexes).

So, a few cycles (typical case) to a few cycles + a branch misprediction + an additional cache miss (very worst case).

The short answer to your question is that futexes are known to be implemented about as efficiently as possible, while a pthread mutex may or may not be. At minimum, a pthread mutex has overhead associated with determining the type of mutex and futexes do not. So a futex will almost always be at least as efficient as a pthread mutex, until and unless someone thinks up some structure lighter than a futex and then releases a pthreads implementation that uses that for its default mutex.

Technically speaking pthread mutexes are not slower or faster than futexes. pthread is just a standard API, so whether they are slow or fast depends on the implementation of that API.

Specifically in Linux pthread mutexes are implemented as futexes and are therefore fast. Actually, you don't want to use the futex API itself as it is very hard to use, does not have the appropriate wrapper functions in glibc and requires coding in assembly which would be non portable. Fortunately for us the glibc maintainers already coded all of this for us under the hood of the pthread mutex API.

Now, because most operating systems did not implement futexes then programmers usually mean by pthread mutex is the performance you get from usual implementation of pthread mutexes, which is, slower.

So it's a statistical fact that in most operating systems that are POSIX compliant the pthread mutex is implemented in kernel space and is slower than a futex. In Linux they have the same performance. It could be that there are other operating systems where pthread mutexes are implemented in user space (in the uncontended case) and therefore have better performance but I am only aware of Linux at this point.

On AMD64 a futex is 4 bytes, while a NPTL pthread_mutex_t is 56 bytes! Yes, there is a significant overhead.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow