Parallel write to array slower than serial write using OmniThreadLibrary

Question 1

This is due to the lack of thread safety of Random. The implementation of which is:

// global var
var
  RandSeed: Longint = 0;    { Base for random number generator }

function Random: Extended;
const
  two2neg32: double = ((1.0/$10000) / $10000);  // 2^-32
var
  Temp: Longint;
  F: Extended;
begin
  Temp := RandSeed * $08088405 + 1;
  RandSeed := Temp;
  F  := Int64(Cardinal(Temp));
  Result := F * two2neg32;
end;

Because RandSeed is a global variable, which is modified by a call to Random, the threads end up having contended writes to RandSeed. And those contended writes cause your performance problem. They effectively serialize your parallel code. Severely enough to make it slower than the true serial code.

Add the code below to the top of the implementation section of your unit and you'll see the difference:

threadvar
  RandSeed: Longint;

function Random: Double;
const
  two2neg32: double = ((1.0/$10000) / $10000);  // 2^-32
var
  Temp: Longint;
  F: Double;
begin
  Temp := RandSeed * $08088405 + 1;
  RandSeed := Temp;
  F  := Int64(Cardinal(Temp));
  Result := F * two2neg32;
end;

With that change to avoid shared, contended writes, you'll find that the parallel version is faster, as expected. You don't get linear scaling with processor count. My guess is that is because your pattern of memory access is sub-optimal in the parallel version of the code.

I'm guessing that you are only using Random as a means to generate some data. But if you do need an RNG, you'll want to arrange that each task uses their own private instance of an RNG.

You can also speed up your code a little using Sqr(X) rather than X*X, and also by switching to Double instead of Extended.

Question 2

Some time ago I was experiencing exactly the same issue. It turned out to be that the bottleneck was that OTL for Parallel.ForEach calls with a range creates a hidden enumerator which in cases where the task is very small and the loop is called often is the bottleneck.

A more performant solution looked something like this:

Parallel.ForEach(0, MAXCORES)
    .NumTasks(MAXCORES)
    .Execute(
      procedure (const p:Integer)
      var
        chunkSize : Integer;
        myStart, myEnd : Integer;
        i: Integer;
      begin
        chunkSize := DIMENSION div MAXCORES;
        myStart := p * chunkSize;
        myEnd := min( myStart+chunkSize-1, DIMENSION -1);
        for I := myStart to MyEnd do
          DoSomething(i);
      end);

This code scaled up quite linearly regardless of the load within the DoSomething call

Question 3

I've tried running this (with the Random fix and using Doubles) on an i7 (8 hyper threads) and get the times 1650ms for parallel and 5240ms for serial. Given the code content I don't find this to be particularly unexpected scale up. The code as it stands will have near to a 100% successful pipeline prediction - all branches predicted, function call returns cached, even cache prefetch working well. On a typical modern PC this means that the code is probably going to be memory bandwidth limited in which scale up is going to depend a great deal on your memory performance rather than how many cores you have.

The only other issue is potential contention for FPU resources which will be highly dependent on your internal processor architecture.

I suspect that if the workload was more complex a greater scale up would be seen between serial and parallel as the serial version will be losing time to code triggered pipeline breaks whilst the parallel version will remain memory limited. I've done a fair bit of high performance computing work in Delphi and well optimised algorithms doing simple calculations can become totally memory bound with multi-threaded performance at scale ups of as little as 2 on a good 8 core machine due to memory bandwidth limits. This sort of issue can be particularly well illustrated if you have over-clocking capability as performance yield from over-clocking the CPU gives a very good indication of the level of memory waits since everything else speeds up proportionally to the over-clocking.

If you want to get into the details of processor architecture and how they impact what you are doing then http://www.agner.org/optimize/ is a good place to learn how much there is to learn.