FLD instruction x64 bit

Question 1

In x64 mode floating point parameters are passed in xmm-registers. So when Delphi tries to compile FLD X, it becomes FLD xmm0 but there is no such instruction. You first need to move it to memory.

The same goes with the result, it should be passed back in xmm0.

Try this (not tested):

function DoSomething(X:Double):Double;
var
  Temp : double;
asm
  MOVQ qword ptr Temp,X
  FLD Temp
  //do something
  FST Temp
  MOVQ xmm0,qword ptr Temp
end;

Question 2

Delphi inherite Microsoft x64 Calling Convention. So if arguments of function/procedure are float/double, they are passed in XMM0L, XMM1L, XMM2L, and XMM3L registers.

But you can use var before parameter as workaround like:

function DoSomething(var X:Double):Double;
asm
  FLD  qword ptr [X]
  // Do Something ..
  FST Result
end;

Question 3

You don't need to use legacy x87 stack registers in x86-64 code, because SSE2 is baseline, a required part of the x86-64 ISA. You can and should do your scalar FP math using addsd, mulsd, sqrtsd and so on, on XMM registers. (Or addss for float)

The Windows x64 calling convention passes float/double FP args in XMM0..3, if they're one of the first four args to the function. (i.e. the 3rd total arg goes in xmm2 if it's FP, rather than the 3rd FP arg going in xmm2.) It returns FP values in XMM0.

Only use x87 if you actually need 80-bit precision inside your function. (Instructions like fsin and fyl2x are not fast, and can usually be done just as well by normal math libraries using SSE/SSE2 instructions.

function times2(X:Double):Double;
asm
    addsd  xmm0, xmm0       // upper 8 bytes of XMM0 are ignored
    ret
end

Storing to memory and reloading into an x87 register costs you about 10 cycles of latency for no benefit. SSE/SSE2 scalar instructions are just as fast, or faster, than their x87 equivalents, and easier to program for and optimize because you never need fxch; it's a flat register design instead of stack-based. (https://agner.org/optimize/). Also, you have 15 XMM registers.

Of course, you usually don't need inline asm at all. It could be useful for manually-vectorizing if the compiler doesn't do that for you.