When are lock free data structures less performant than mutual exclusion (mutexes)?

https://stackoverflow.com/questions/1585818

22-09-2019
|

Question

I read somewhere (can't find the page anymore) that lock free data structures are more efficient "for certain workloads" which seems to imply that sometimes they're actually slower or the gain from them can be zero in some situations. Taking the ~100 cycle hit of a lock instruction to do an atomic op sounds plenty faster to me than going to sleep and waiting for the scheduler to wake the process back up, so it's not obvious to me under what circumstances a lock free data structure would be less preferable than old fashioned mutexes. If the lock is available 99% of the time and the process doesn't have to go to sleep, is a mutex then faster? Is there a good rule of the thumb for knowing which way to go assuming a suitable lock free data structure is available?

Solution

A common approach to implementing a lock-free data structure is to have a mutable reference to an immutable object, and have anything that wants to change the structure grab the reference, produce a new version of the object with suitable changes applied, and then CompareExchange the reference to point to the new object. If the CompareExchange works, great. If not, ditch the new object, re-grab the reference, and start over.

This can work well if producing the new object is cheap and the level of contention is low enough that the CompareExchange will usually work. If there is considerable contention, and if producing the new object is slow, simultaneous attempted updates by N threads may take N^2 time to complete. As an extreme example, suppose 100 threads are running on a CPU, an update takes 100ms of CPU time (just over a time-slice), and all 100 threads attempt to update an object at once. During the first ten seconds, each thread will produce a new object based on the original one. One of the threads will successfully do the CompareExchange, while the others will all fail. Then during the next 9.9 seconds, 99 threads will generate new versions of the object, after which one will successfully post its update and 98 will fail. The net effect will be that the lock-free method will take 505 seconds' worth of CPU time to perform 100 updates, when a locking system could have done them in about 10 seconds.

OTHER TIPS

lockless data structures will, one way or another, use atomic semantics from your architecture to perform its core operations. When you do this, you can be using the machines entire internal exclusion mechanisms to ensure correct ordering or fencing of data. A mutex or critical section also does this, but it only does it once for a single flag. Where the mutex or critical section is slow, is when the the lock acquisition fails (there is contention). In this case, the OS also invokes the scheduler to suspend the thread until the exclusion object has been released.

So it seems logical that whenever your lock-less data structure uses multiple atomic operations per core method when a single lock shielding a critical section could provide the same semantics AND there tends to be very little contention, in practice, for the data structure in question, then in fact, it does make more sense to use an OS-provided locking mechanism, than trying to build your own.

I don't know about making it slower, but it certainly makes it harder to get right. In the many cases where the two approaches are virtually identical in performance (or when it simply doesn't matter if it takes 500 pico-seconds rather than 100 pico-seconds), then pick the simplest approach - generally lock.

There are very few cases when that extra bit of performance is key; and if it is, I suspect you'd do well to use the pre-rolled pattern implementations from established libraries. Getting lock-free code working properly (and proving that it works properly in all conditions) is often very hard.

Note also that some environments offer a level of locking above the OS-provided mutex; mutex behaviour, but without some of the overheads (for example, Monitor in .NET).

I would like to add one point to this part of the answer: "Where the mutex or critical section is slow, is when the the lock acquisition fails (there is contention). In this case, the OS also invokes the scheduler to suspend the thread until the exclusion object has been released."

Seems like different operating systems can have different approaches as to what to do when lock acquisition failed. I use HP-UX and it for example has a more sophisticated approach to locking mutexes. Here is its description:

... On the other hand, changing context is an expensive process. If the wait is going to be a short one, we'd rather not do the context switch. To balance out these requirements, when we try to get a semaphore and find it locked, the first thing we do is a short spin wait. The routine psema_spin_1() is called to spin for up to 50,000 clock cycles trying to get the lock. If we fail to get the lock after 50,000 cycles, we then call psema_switch_1() to give up the processor and let another process take over.

Keep in mind that a mutex may well be implemented as a lock-free data structure, in the sense that it uses one or a few atomic objects to represent its state. It's a false dichotomy.

Better is to consider whether you need to allow multiple threads to wait for access to some set of operations or to block until signaled. Each requires a queue of waiting threads. The former queues threads waiting for access to the synchronized area, while the latter queues threads waiting for a signal. The Java classes AbstractQueuedSynchronizer and AbstractQueuedLongSynchronizer provide such a queue—in particular, a CLH Queue—upon which one can build mutexes, conditions, and other queue-based primitives.

If your requirements favor instead only one thread taking on an exclusive set of work while other threads remain free to carry on with other work, as opposed to waiting until they too can do that same work themselves, then using lock-free techniques is possible. Whether doing so will grant faster run times falls to benchmarking, being subject to how often and how many threads will contend over these synchronization controls, and whether there's other work for threads to perform independently.

Efficiency depends on the metric. Lock-, or wait-free algorithms are important in systems where preemption can introduce deadlock or affect scheduling deadlines. In those cases, processing is less important than correctness.

The OP considers locking as an alternative to mutexes. Some algorithms require neither to access a shared data structure. In these cases, both producer and consumer can access the same data structure concurrently without regard for the other. An example of a shared queue permits a single reader and a single writer to simultaneously act on a shared instance. This meets the common need of a device driver writing data that a consumer process can access on demand.

More complex relationships between processes can be permitted (see Herlihy (1991) for an analysis) with varying levels of hardware support. He concludes Wait-free synchronization represents a qualitative break with the traditional locking-based techniques for implementing concurrent objects.

What it means is that there remains a trade-off, but that it is not one simply between choosing between mutexes and spinlocks.

A rule of thumb remains to focus on correctness rather than performance. Performance can usually be achieved by throwing money at the problem, while meeting requirements is usually more difficult.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow