Question

I've been reading about lock-free techniques, like Compare-and-swap and leveraging the Interlocked and SpinWait classes to achieve thread synchronization without locking.

I've ran a few tests of my own, where I simply have many threads trying to append a character to a string. I tried using regular locks and compare-and-swap. Surprisingly (at least to me), locks showed much better results than using CAS.

Here's the CAS version of my code (based on this). It follows a copy->modify->swap pattern:

    private string _str = "";
    public void Append(char value)
    {
        var spin = new SpinWait();
        while (true)
        {
            var original = Interlocked.CompareExchange(ref _str, null, null);

            var newString = original + value;                
            if (Interlocked.CompareExchange(ref _str, newString, original) == original)
                break;
            spin.SpinOnce();
        }
    }

And the simpler (and more efficient) lock version:

    private object lk = new object();
    public void AppendLock(char value)
    {
        lock (lk)
        {
            _str += value;
        }
    }

If i try adding 50.000 characters, the CAS version takes 1.2 seconds and the lock version 700ms (average). For 100k characters, they take 7 seconds and 3.8 seconds, respectively. This was run on a quad-core (i5 2500k).

I suspected the reason why CAS was displaying these results was because it was failing the last "swap" step a lot. I was right. When I try adding 50k chars (50k successful swaps), i was able to count between 70k (best case scenario) and almost 200k (worst case scenario) failed attempts. Worst case scenario, 4 out of every 5 attempts failed.

So my questions are:

  1. What am I missing? Shouldn't CAS give better results? Where's the benefit?
  2. Why exactly and when is CAS a better option? (I know this has been asked, but I can't find any satisfying answer that also explains my specific scenario).

It is my understanding that solutions employing CAS, although hard to code, scale much better and perform better than locks as contention increases. In my example, the operations are very small and frequent, which means high contention and high frequency. So why do my tests show otherwise?

I assume that longer operations would make the case even worse -> the "swap" failing rate would increase even more.

PS: this is the code I used to run the tests:

Stopwatch watch = Stopwatch.StartNew();
var cl = new Class1();
Parallel.For(0, 50000, i => cl.Append('a'));

var time = watch.Elapsed;
Debug.WriteLine(time.TotalMilliseconds);
Was it helpful?

Solution

The problem is a combination of the failure rate on the loop and the fact that strings are immutable. I did a couple of tests on my own using the following parameters.

  • Ran 8 different threads (I have an 8 core machine).
  • Each thread called Append 10,000 times.

What I observed was that the final length of the string was 80,000 (8 x 10,000) so that was perfect. The number of append attempts averaged ~300,000 for me. So that is a failure rate of ~73%. Only 27% of the CPU time resulted in useful work. Now because strings are immutable that means a new instance of the string is created on the heap and the original contents plus the one extra character is copied into it. By the way, this copy operation is O(n) so it gets longer and longer as the string's length increases. Because of the copy operation my hypothesis was that the failure rate would increase as the length of the string increases. The reason being that as the copy operation takes more and more time there is a higher probability of a collision as the threads spend more time competing to finalize the ICX. My tests confirmed this. You should try this same test yourself.

The biggest issue here is that sequential string concatenations do not lend themselves to parallelism very well. Since the results of the operation Xn depend on Xn-1 it is going to be faster to take the full lock especially if it means you avoid all of the failures and retries. A pessimistic strategy wins the battle against an optimistic one in this case. The low techniques work better when you can partition the problem into independent chucks that really can run unimpeded in parallel.

As a side note the use of Interlocked.CompareExchange to do the initial read of _str is unnecessary. The reason is that a memory barrier is not required for the read in this case. This is because the Interlocked.CompareExchange call that actually performs work (the second one in your code) will create a full barrier. So the worst case scenario is that the first read is "stale", the ICX operation fails the test, and the loop spins back around to try again. This time, however, the previous ICX forced a "fresh" read.1

The following code is how I generalize a complex operation using low lock mechanisms. In fact, the code presented below allows you to pass a delegate representing the operation so it is very generalized. Would you want to use it in production? Probably not because invoking the delegate is slow, but you at least you get the idea. You could always hard code the operation.

public static class InterlockedEx
{
  public static T Change<T>(ref T destination, Func<T, T> operation) where T : class
  {
    T original, value;
    do
    {
        original = destination;
        value = operation(original);
    }
    while (Interlocked.CompareExchange(ref destination, value, original) != original);
    return original;
  }
}

1I actually dislike the terms "stale" and "fresh" when discussing memory barriers because that is not what they are about really. It is more of a side effect as opposed to actual guarantee. But, in this case it illustrates my point better.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top