How to use align-data-move SSE in Delphi XE3?

Question 1

You need your data to be 16 byte aligned. That requires some care and attention. You can make sure that the heap allocator aligns to 16 bytes. But you cannot make sure that the compiler will 16 byte align your stack allocated variables because your array has an alignment property of 4, the size of its elements. And any variables declared inside other structures will also have 4 byte alignment. Which is a tough hurdle to clear.

I don't think you can solve your problem in the currently available versions of the compiler. At least not unless you forgo stack allocated variables which I'd guess to be too bitter a pill to swallow. You might have some luck with an external assembler.

Question 2

You can write your own memory allocation routines that allocate aligned data in the heap. You can specify your own alignment size (not just 16 bytes but also 32 bytes, 64 bytes and so on...):

    procedure GetMemAligned(const bits: Integer; const src: Pointer;
      const SrcSize: Integer; out DstAligned, DstUnaligned: Pointer;
      out DstSize: Integer);
    var
      Bytes: NativeInt;
      i: NativeInt;
    begin
      if src <> nil then
      begin
        i := NativeInt(src);
        i := i shr bits;
        i := i shl bits;
        if i = NativeInt(src) then
        begin
          // the source is already aligned, nothing to do
          DstAligned := src;
          DstUnaligned := src;
          DstSize := SrcSize;
          Exit;
        end;
      end;
      Bytes := 1 shl bits;
      DstSize := SrcSize + Bytes;
      GetMem(DstUnaligned, DstSize);
      FillChar(DstUnaligned^, DstSize, 0);
      i := NativeInt(DstUnaligned) + Bytes;
      i := i shr bits;
      i := i shl bits;
      DstAligned := Pointer(i);
      if src <> nil then
        Move(src^, DstAligned^, SrcSize);
    end;

    procedure FreeMemAligned(const src: Pointer; var DstUnaligned: Pointer;
      var DstSize: Integer);
    begin
      if src <> DstUnaligned then
      begin
        if DstUnaligned <> nil then
          FreeMem(DstUnaligned, DstSize);
      end;
      DstUnaligned := nil;
      DstSize := 0;
    end;

Then use pointers and procedures as a third argument to return the result.

You can also use functions, but it is not that evident.

type
  PVector^ = TVector;
  TVector  = packed array [1..4] of Single;

Then allocate these objects that way:

const
   SizeAligned = SizeOf(TVector);
var
   DataUnaligned, DataAligned: Pointer;
   SizeUnaligned: Integer;
   V1: PVector;
begin
  GetMemAligned(4 {align by 4 bits, i.e. by 16 bytes}, nil, SizeAligned, DataAligned, DataUnaligned, SizeUnaligned);
  V1 := DataAligned;
  // now you can work with your vector via V1^ - it is aligned by 16 bytes and stays in the heap

  FreeMemAligned(nil, DataUnaligned, SizeUnaligned);
end;

As you have pointed out, we have passed nil to GetMemAligned and FreeMemAligned - this parameter is needed when we want to align existing data, e.g. one which we have received as a function argument, for example.

Just use straight register names rather than parameter names in assembly routines. You will not mess anything with that when using register calling convension - otherwise you risk to modify the registers without knowing that the parameter names used are just aliases for the registers.

Under Win64, with Microsoft calling convention, first parameter is always passed as RCX, second - RDX, third R8, fourth - R9, the rest in stack. A function returns the result in RAX. But if a function returns a structure ("record") result, it is not returned in RAX, but in an implicit argument, by address. The following registers may be modifyed by your function after the call: RAX,RCX,RDX,R8,R9,R10,R11. The rest should be preserved. See https://msdn.microsoft.com/en-us/library/ms235286.aspx for more details.

Under Win32, with Delphi register calling convention, a call passes first parameter in EAX, second in EDX, third in ECX, and rest in stack

The following table summarizes the differences:

         64     32
         ---   ---
    1)   rcx   eax
    2)   rdx   edx
    3)   r8    ecx
    4)   r9    stack

So, your function will look like this (32-bit):

procedure add4(const a, b: TVector; out Result: TVector); register; assembler;
asm
  movaps xmm0, [eax]
  movaps xmm1, [edx]
  addps xmm0, xmm1
  movaps [ecx], xmm0
end;

Under 64-bit;

procedure add4(const a, b: TVector; out Result: TVector); register; assembler;
asm
  movaps xmm0, [rcx]
  movaps xmm1, [rdx]
  addps xmm0, xmm1
  movaps [r8], xmm0
end;

By the way, according to Microsoft, floating point arguments in 64-bit calling convention are passed in direct in the XMM registers: first in XMM0, second in XMM1, third in XMM2, and fourth in XMM3, and rest in stack. So you can pass them by value, not by reference.

Question 3

Use this to make the built-in memory manager allocate with 16-byte alignment:

SetMinimumBlockAlignment(mba16Byte);

Also, as far as I know, both "register" and "assembler" are redundant directives so you can skip those from your code.

--

Edit: you mention this is for x64. I just tried the following in Delphi XE2 compiled for x64 and it works here.

program Project3;

type
  Vector = array [1..4] of Single;

function add4(const a, b: Vector): Vector;
asm
  movaps xmm0, [a]
  movaps xmm1, [b]
  addps xmm0, xmm1
  movaps [@result], xmm0
end;

procedure f();
var
  v1,v2 : vector;
begin
  v1[1] := 1;
  v2[1] := 1;
  v1 := add4(v1,v2);
end;

begin
  {$ifndef cpux64}
  {$MESSAGE FATAL 'this example is for x64 target only'}
  {$else}
  f();
  {$endif}
end.