Frage

I am working on an implementation of the Differential Evolution optimization algorithm, and want to speed up the calculation time by calculating population members in parallel. I am using the OmniThread library, and have successfully parallelized my loop, only to find that it runs SLOWER than the serial implementation.

I have reduced the code to its essence to test the parallelization, and the reduced version exhibits the same problem: the parallel version is slower than the serial one.

The key is that I pass multiple dynamic arrays to which output should be written for each member of the population. Each array has one of the dimensions dedicated to the population member, so for each population member a different set of array indices is accessed. This also means that in the parallel implementation no 2 threads will write to the same array element.

Below the code I used to test (the actual code in the Differential Evolution has a DoWork procedure with even more const parameters and var arrays)

unit Unit1;

interface

type
  TGoalFunction = reference to function(const X, B: array of extended): extended;
  TArrayExtended1D = array of extended;
  TArrayExtended2D = array of TArrayExtended1D;

  TClassToTest = class abstract
  private
    class procedure DoWork(const AGoalFunction: TGoalFunction; const AInputArray: TArrayExtended2D; var AOutputArray1: TArrayExtended1D; var AOutputArray2: TArrayExtended2D; const AIndex, AIndex2: integer);
  public
    class procedure RunSerial;
    class procedure RunParallel;
  end;

function HyperSphere(const X, B: array of extended): extended;

const
  DIMENSION1 = 5000;
  DIMENSION2 = 5000;
  LOOPS = 10;

implementation

uses
  OtlParallel;

function HyperSphere(const X, B: array of extended): extended;
var
  I: Integer;
begin
  Result := 0;
  for I := 0 to Length(X) - 1 do
    Result := Result + X[I]*X[I];
end;

{ TClassToTest }

class procedure TClassToTest.DoWork(const AGoalFunction: TGoalFunction; const AInputArray: TArrayExtended2D; var AOutputArray1: TArrayExtended1D; var AOutputArray2: TArrayExtended2D; const AIndex, AIndex2: integer);
var
  I: Integer;
begin
  AOutputArray1[AIndex] := AGoalFunction(AInputArray[AIndex], []);
  for I := 0 to Length(AOutputArray2[AIndex]) - 1 do
    AOutputArray2[AIndex, I] := Random*AIndex2;
end;

class procedure TClassToTest.RunParallel;
var
  LGoalFunction: TGoalFunction;
  LInputArray: TArrayExtended2D;
  LOutputArray1: TArrayExtended1D;
  LOutputArray2: TArrayExtended2D;
  I, J, K: Integer;
begin
  SetLength(LInputArray, DIMENSION1, DIMENSION2);
  for I := 0 to DIMENSION1 - 1 do
  begin
    for J := 0 to DIMENSION2 - 1 do
      LInputArray[I, J] := Random;
  end;
  SetLength(LOutputArray1, DIMENSION1);
  SetLength(LOutputArray2, DIMENSION1, DIMENSION2);

  LGoalFunction := HyperSphere;

  for I := 0 to LOOPS - 1 do
  begin
    Parallel.ForEach(0, DIMENSION1 - 1).Execute(
      procedure (const value: integer)
      begin
        DoWork(LGoalFunction, LInputArray, LOutputArray1, LOutputArray2, value, I);
      end
    );

    for J := 0 to DIMENSION1 - 1 do
    begin
      for K := 0 to DIMENSION2 - 1 do
        LInputArray[J, K] := LOutputArray2[J, K];
    end;
  end;
end;

class procedure TClassToTest.RunSerial;
var
  LGoalFunction: TGoalFunction;
  LInputArray: TArrayExtended2D;
  LOutputArray1: TArrayExtended1D;
  LOutputArray2: TArrayExtended2D;
  I, J, K: Integer;
begin
  SetLength(LInputArray, DIMENSION1, DIMENSION2);
  for I := 0 to DIMENSION1 - 1 do
  begin
    for J := 0 to DIMENSION2 - 1 do
      LInputArray[I, J] := Random;
  end;
  SetLength(LOutputArray1, DIMENSION1);
  SetLength(LOutputArray2, DIMENSION1, DIMENSION2);

  LGoalFunction := HyperSphere;

  for I := 0 to LOOPS - 1 do
  begin
    for J := 0 to DIMENSION1 - 1 do
    begin
      DoWork(LGoalFunction, LInputArray, LOutputArray1, LOutputArray2, J, I);
    end;

    for J := 0 to DIMENSION1 - 1 do
    begin
      for K := 0 to DIMENSION2 - 1 do
        LInputArray[J, K] := LOutputArray2[J, K];
    end;
  end;
end;

end.

I was expecting a speedup of around x6 on my 8-core processor, but was faced with a slight slowdown. What should I change to get the speedup from running the DoWork procedure in parallel?

Note that I'd prefer to keep the actual work in the DoWork procedure, since I have to be able to call the same algorithm with and without parallelization (boolean flag) while keeping the body of the code shared for easy maintenance

War es hilfreich?

Lösung

This is due to the lack of thread safety of Random. The implementation of which is:

// global var
var
  RandSeed: Longint = 0;    { Base for random number generator }

function Random: Extended;
const
  two2neg32: double = ((1.0/$10000) / $10000);  // 2^-32
var
  Temp: Longint;
  F: Extended;
begin
  Temp := RandSeed * $08088405 + 1;
  RandSeed := Temp;
  F  := Int64(Cardinal(Temp));
  Result := F * two2neg32;
end;

Because RandSeed is a global variable, which is modified by a call to Random, the threads end up having contended writes to RandSeed. And those contended writes cause your performance problem. They effectively serialize your parallel code. Severely enough to make it slower than the true serial code.

Add the code below to the top of the implementation section of your unit and you'll see the difference:

threadvar
  RandSeed: Longint;

function Random: Double;
const
  two2neg32: double = ((1.0/$10000) / $10000);  // 2^-32
var
  Temp: Longint;
  F: Double;
begin
  Temp := RandSeed * $08088405 + 1;
  RandSeed := Temp;
  F  := Int64(Cardinal(Temp));
  Result := F * two2neg32;
end;

With that change to avoid shared, contended writes, you'll find that the parallel version is faster, as expected. You don't get linear scaling with processor count. My guess is that is because your pattern of memory access is sub-optimal in the parallel version of the code.

I'm guessing that you are only using Random as a means to generate some data. But if you do need an RNG, you'll want to arrange that each task uses their own private instance of an RNG.

You can also speed up your code a little using Sqr(X) rather than X*X, and also by switching to Double instead of Extended.

Andere Tipps

Some time ago I was experiencing exactly the same issue. It turned out to be that the bottleneck was that OTL for Parallel.ForEach calls with a range creates a hidden enumerator which in cases where the task is very small and the loop is called often is the bottleneck.

A more performant solution looked something like this:

Parallel.ForEach(0, MAXCORES)
    .NumTasks(MAXCORES)
    .Execute(
      procedure (const p:Integer)
      var
        chunkSize : Integer;
        myStart, myEnd : Integer;
        i: Integer;
      begin
        chunkSize := DIMENSION div MAXCORES;
        myStart := p * chunkSize;
        myEnd := min( myStart+chunkSize-1, DIMENSION -1);
        for I := myStart to MyEnd do
          DoSomething(i);
      end);

This code scaled up quite linearly regardless of the load within the DoSomething call

I've tried running this (with the Random fix and using Doubles) on an i7 (8 hyper threads) and get the times 1650ms for parallel and 5240ms for serial. Given the code content I don't find this to be particularly unexpected scale up. The code as it stands will have near to a 100% successful pipeline prediction - all branches predicted, function call returns cached, even cache prefetch working well. On a typical modern PC this means that the code is probably going to be memory bandwidth limited in which scale up is going to depend a great deal on your memory performance rather than how many cores you have.

The only other issue is potential contention for FPU resources which will be highly dependent on your internal processor architecture.

I suspect that if the workload was more complex a greater scale up would be seen between serial and parallel as the serial version will be losing time to code triggered pipeline breaks whilst the parallel version will remain memory limited. I've done a fair bit of high performance computing work in Delphi and well optimised algorithms doing simple calculations can become totally memory bound with multi-threaded performance at scale ups of as little as 2 on a good 8 core machine due to memory bandwidth limits. This sort of issue can be particularly well illustrated if you have over-clocking capability as performance yield from over-clocking the CPU gives a very good indication of the level of memory waits since everything else speeds up proportionally to the over-clocking.

If you want to get into the details of processor architecture and how they impact what you are doing then http://www.agner.org/optimize/ is a good place to learn how much there is to learn.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top