Question

I was trying to run the following,

type
  Vector = array [1..4] of Single;

{$CODEALIGN 16}
function add4(const a, b: Vector): Vector; register; assembler;
asm
  movaps xmm0, [a]
  movaps xmm1, [b]
  addps xmm0, xmm1
  movaps [@result], xmm0
end;

It gives Access Violation on movaps, as far as I know, the movaps can be trusted if the memory location is 16-align. It works no problem if movups (no align is needed).

So my question is, in Delphi XE3, {$CODEALIGN} seems not working in this case.

EDIT

Very strange... I tried the following.

program Project3;

{$APPTYPE CONSOLE}

uses
  windows;  // if not using windows, no errors at all

type
  Vector = array [1..4] of Single;

function add4(const a, b: Vector): Vector;
asm
  movaps xmm0, [a]
  movaps xmm1, [b]
  addps xmm0, xmm1
  movaps [@result], xmm0
end;

procedure test();
var
  v1, v2: vector;
begin
  v1[1] := 1;
  v2[1] := 1;
  v1 := add4(v1,v2);  // this works
end;

var
  a, b, c: Vector;

begin
  {$ifndef cpux64}
    {$MESSAGE FATAL 'this example is for x64 target only'}
  {$else}
  test();
  c := add4(a, b); // throw out AV here
  {$endif}
end.

If 'use windows' is not added, everything is fine. If 'use window', then it will throw out exception at c := add4(a, b) but not in test().

Who can explain this?

EDIT it all makes sense to me, now. the conclusions for Delphi XE3 - 64-bit are

  1. stack frames at X64 are set to 16-byte (as required), {$CODEALIGN 16} aligns code for proc/fun to 16 byte.
  2. the dynamic array lives in heap, which can be set to align 16 using SetMinimumBlockAlignment(mba16byte)
  3. however, the stack vars are not always 16-byte aligned, for example, if you declare a integer var before v1, v2 in the above example, e.g. test(), the example will not work.
Was it helpful?

Solution

You need your data to be 16 byte aligned. That requires some care and attention. You can make sure that the heap allocator aligns to 16 bytes. But you cannot make sure that the compiler will 16 byte align your stack allocated variables because your array has an alignment property of 4, the size of its elements. And any variables declared inside other structures will also have 4 byte alignment. Which is a tough hurdle to clear.

I don't think you can solve your problem in the currently available versions of the compiler. At least not unless you forgo stack allocated variables which I'd guess to be too bitter a pill to swallow. You might have some luck with an external assembler.

OTHER TIPS

You can write your own memory allocation routines that allocate aligned data in the heap. You can specify your own alignment size (not just 16 bytes but also 32 bytes, 64 bytes and so on...):

    procedure GetMemAligned(const bits: Integer; const src: Pointer;
      const SrcSize: Integer; out DstAligned, DstUnaligned: Pointer;
      out DstSize: Integer);
    var
      Bytes: NativeInt;
      i: NativeInt;
    begin
      if src <> nil then
      begin
        i := NativeInt(src);
        i := i shr bits;
        i := i shl bits;
        if i = NativeInt(src) then
        begin
          // the source is already aligned, nothing to do
          DstAligned := src;
          DstUnaligned := src;
          DstSize := SrcSize;
          Exit;
        end;
      end;
      Bytes := 1 shl bits;
      DstSize := SrcSize + Bytes;
      GetMem(DstUnaligned, DstSize);
      FillChar(DstUnaligned^, DstSize, 0);
      i := NativeInt(DstUnaligned) + Bytes;
      i := i shr bits;
      i := i shl bits;
      DstAligned := Pointer(i);
      if src <> nil then
        Move(src^, DstAligned^, SrcSize);
    end;

    procedure FreeMemAligned(const src: Pointer; var DstUnaligned: Pointer;
      var DstSize: Integer);
    begin
      if src <> DstUnaligned then
      begin
        if DstUnaligned <> nil then
          FreeMem(DstUnaligned, DstSize);
      end;
      DstUnaligned := nil;
      DstSize := 0;
    end;

Then use pointers and procedures as a third argument to return the result.

You can also use functions, but it is not that evident.

type
  PVector^ = TVector;
  TVector  = packed array [1..4] of Single;

Then allocate these objects that way:

const
   SizeAligned = SizeOf(TVector);
var
   DataUnaligned, DataAligned: Pointer;
   SizeUnaligned: Integer;
   V1: PVector;
begin
  GetMemAligned(4 {align by 4 bits, i.e. by 16 bytes}, nil, SizeAligned, DataAligned, DataUnaligned, SizeUnaligned);
  V1 := DataAligned;
  // now you can work with your vector via V1^ - it is aligned by 16 bytes and stays in the heap

  FreeMemAligned(nil, DataUnaligned, SizeUnaligned);
end;

As you have pointed out, we have passed nil to GetMemAligned and FreeMemAligned - this parameter is needed when we want to align existing data, e.g. one which we have received as a function argument, for example.

Just use straight register names rather than parameter names in assembly routines. You will not mess anything with that when using register calling convension - otherwise you risk to modify the registers without knowing that the parameter names used are just aliases for the registers.

Under Win64, with Microsoft calling convention, first parameter is always passed as RCX, second - RDX, third R8, fourth - R9, the rest in stack. A function returns the result in RAX. But if a function returns a structure ("record") result, it is not returned in RAX, but in an implicit argument, by address. The following registers may be modifyed by your function after the call: RAX,RCX,RDX,R8,R9,R10,R11. The rest should be preserved. See https://msdn.microsoft.com/en-us/library/ms235286.aspx for more details.

Under Win32, with Delphi register calling convention, a call passes first parameter in EAX, second in EDX, third in ECX, and rest in stack

The following table summarizes the differences:

         64     32
         ---   ---
    1)   rcx   eax
    2)   rdx   edx
    3)   r8    ecx
    4)   r9    stack

So, your function will look like this (32-bit):

procedure add4(const a, b: TVector; out Result: TVector); register; assembler;
asm
  movaps xmm0, [eax]
  movaps xmm1, [edx]
  addps xmm0, xmm1
  movaps [ecx], xmm0
end;

Under 64-bit;

procedure add4(const a, b: TVector; out Result: TVector); register; assembler;
asm
  movaps xmm0, [rcx]
  movaps xmm1, [rdx]
  addps xmm0, xmm1
  movaps [r8], xmm0
end;

By the way, according to Microsoft, floating point arguments in 64-bit calling convention are passed in direct in the XMM registers: first in XMM0, second in XMM1, third in XMM2, and fourth in XMM3, and rest in stack. So you can pass them by value, not by reference.

Use this to make the built-in memory manager allocate with 16-byte alignment:

SetMinimumBlockAlignment(mba16Byte);

Also, as far as I know, both "register" and "assembler" are redundant directives so you can skip those from your code.

--

Edit: you mention this is for x64. I just tried the following in Delphi XE2 compiled for x64 and it works here.

program Project3;

type
  Vector = array [1..4] of Single;

function add4(const a, b: Vector): Vector;
asm
  movaps xmm0, [a]
  movaps xmm1, [b]
  addps xmm0, xmm1
  movaps [@result], xmm0
end;

procedure f();
var
  v1,v2 : vector;
begin
  v1[1] := 1;
  v2[1] := 1;
  v1 := add4(v1,v2);
end;

begin
  {$ifndef cpux64}
  {$MESSAGE FATAL 'this example is for x64 target only'}
  {$else}
  f();
  {$endif}
end.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top