Locks vs Compare-and-swap

Question

The problem is a combination of the failure rate on the loop and the fact that strings are immutable. I did a couple of tests on my own using the following parameters.

Ran 8 different threads (I have an 8 core machine).
Each thread called Append 10,000 times.

What I observed was that the final length of the string was 80,000 (8 x 10,000) so that was perfect. The number of append attempts averaged ~300,000 for me. So that is a failure rate of ~73%. Only 27% of the CPU time resulted in useful work. Now because strings are immutable that means a new instance of the string is created on the heap and the original contents plus the one extra character is copied into it. By the way, this copy operation is O(n) so it gets longer and longer as the string's length increases. Because of the copy operation my hypothesis was that the failure rate would increase as the length of the string increases. The reason being that as the copy operation takes more and more time there is a higher probability of a collision as the threads spend more time competing to finalize the ICX. My tests confirmed this. You should try this same test yourself.

The biggest issue here is that sequential string concatenations do not lend themselves to parallelism very well. Since the results of the operation X_n depend on X_n-1 it is going to be faster to take the full lock especially if it means you avoid all of the failures and retries. A pessimistic strategy wins the battle against an optimistic one in this case. The low techniques work better when you can partition the problem into independent chucks that really can run unimpeded in parallel.

As a side note the use of Interlocked.CompareExchange to do the initial read of _str is unnecessary. The reason is that a memory barrier is not required for the read in this case. This is because the Interlocked.CompareExchange call that actually performs work (the second one in your code) will create a full barrier. So the worst case scenario is that the first read is "stale", the ICX operation fails the test, and the loop spins back around to try again. This time, however, the previous ICX forced a "fresh" read.¹

The following code is how I generalize a complex operation using low lock mechanisms. In fact, the code presented below allows you to pass a delegate representing the operation so it is very generalized. Would you want to use it in production? Probably not because invoking the delegate is slow, but you at least you get the idea. You could always hard code the operation.

public static class InterlockedEx
{
  public static T Change<T>(ref T destination, Func<T, T> operation) where T : class
  {
    T original, value;
    do
    {
        original = destination;
        value = operation(original);
    }
    while (Interlocked.CompareExchange(ref destination, value, original) != original);
    return original;
  }
}

¹I actually dislike the terms "stale" and "fresh" when discussing memory barriers because that is not what they are about really. It is more of a side effect as opposed to actual guarantee. But, in this case it illustrates my point better.