Is performance data for a lockfree stack at http://tinyurl.com/pzpyvb9 realistic?

Question 1

At best, "overall throughput" decreases and then levels out as the number of threads increases. Shouldn't overall throughput increase as the number of threads increases?

This is normaly! The stack is only one! Bottleneck is memory synchronisation between threads and not code execution. So if more threads fill the stack then more memory conflict should heppend (race conditions arise) so throughput decrease.

In the final chart, the performance values for 1 thread range from about 35M to 55M. This seems like an awfully wide range for 1 thread (where there can't be any contention).

This test code is not realistic because it doing nothing more than poping and pushing the nodes in to the stack. So minimum difference at relatively short code can dramatically influence to the throughput.

If you check the code than you can see that the SpinLock version is very simple and could be faster than LockFree because is made with test_and_set atomic function which is normaly faster than atomic CAS (compare and swap) which is used at LockFree version.

EDIT:

At LockFree version is used cmpxchg16b which is 16 byte CAS at SpinLock is used only 8 byte test_and_set function (implemented with cmpxchg8b) so the speed difference exist!

Question 2

At best, "overall throughput" decreases and then levels out as the number of threads increases. Shouldn't overall throughput increase as the number of threads increases?

Not necessarily, and definitely not in this case. Multithreading is beneficial if the threads of execution can run concurrently. It's better to make an application single threaded if the threads can never run concurrently or if almost all of the execution involves contention for one single resource.

This test was designed so that threads cannot run concurrently. Almost all of the code is contends for a single resource, a stack. This nominally bad design was intentionally used here so as to test the overhead expense of various concurrent execution schemes and to test how well the various schemes deal with contention.

In the final chart, the performance values for 1 thread range from about 35M to 55M. This seems like an awfully wide range for 1 thread (where there can't be any contention).

Even without contention, there's a lot of overhead involved in locking and unlocking a mutex. There's a lot less overhead in compare-and-swap, even less in test-and-set. That pure overhead is what is being tested when there is only one thread of execution.