I am writing a code library in x86-64 assembly-language to provide all conventional bitwise, shift, logical, compare, arithmetic and math functions for s0128
, s0256
, s0512
, s1024
, s2048
, and s4096
signed-integer types and f0128
, f0256
, f0512
, f1024
, f2048
, and f4096
floating-point types.
Now I am writing some type conversion routines, and have run into something that should be trivial but takes a lot more instructions than I would expect. I feel like I must be missing something (some instructions) to make this easier, but so far no luck.
The low 128-bits of the s0256
result is simply a copy of the s0128
input argument, and all the bits in the upper 128-bits of the s0256
result must be set to the most-significant bit in the s0128
input argument.
Simple, huh? But here is the best I can figure so far to convert s0256
to s0128
. Ignore the first 4 lines (they're just argument error checks) and last 2 lines (returning from function with no error (rax == 0)). The 5 lines in the middle are the algorithm in question. Try to avoid [conditional] jump instructions.
.text
.align 64
big_m63:
.quad -63, -63 # two shift counts for vpshaq instruction
big_s0256_eq_s0128: # (s0256* arg0, const s0128* arg1); # s0256 = s0256(s0128)
orq %rdi, %rdi # is arg0 a valid address ???
jz error_argument_invalid # nope
orq %rsi, %rsi # is arg1 a valid address ???
jz error_argument_invalid # nope
vmovapd (%rsi), %xmm0 # ymm0 = arg1.ls64 : arg1.ms64 : 0 : 0
vmovhlps %xmm0, %xmm0, %xmm1 # ymm1 = arg1.ms64 : arg1.ms64 : 0 : 0
vpshaq big_m63, %xmm1, %xmm1 # ymm1 = arg1.sign : arg1.sign : 0 : 0
vperm2f128 $32, %ymm1, %ymm0, %ymm0 # ymm1 = arg1.ls64 : arg1.ms64 : sign : sign
vmovapd %ymm0, (%rdi) # arg0 = arg1 (sign-extended to 256-bits)
xorq %rax, %rax # rax = 0 == no error
ret # return from function
This routine is also non-optimal in that every instruction requires the result of the previous instruction, which prevents parallel execution of any instructions.
Is there a better instruction to right-shift with sign extension? I cannot find an instruction like vpshaq
that accepts an immediate byte to specify shift-count, though I don't know why (many SIMD instructions have immediate 8-bit operands for various purposes). Also, Intel does not support vpshaq
. Oops!
But look! StephenCanon has a brilliant solution to this problem below! Awesome! That solution has one more instruction than the above, but the vpxor
instruction can be put after the first vmovapd
instruction and should effectively take no more cycles than the 5 instruction version above. Bravo!
For completeness and easy comparison, here is the code with the latest StephenCanon enhancement:
.text
.align 64
big_s0256_eq_s0128: # (s0256* arg0, const s0128* arg1); # s0256 = s0256(s0128)
orq %rdi, %rdi # is arg0 a valid address ???
jz error_argument_invalid # nope
orq %rsi, %rsi # is arg1 a valid address ???
jz error_argument_invalid # nope
vmovapd (%rsi), %xmm0 # ymm0 = arg1.ls64 : arg1.ms64 : 0 : 0
vpxor %xmm2, %xmm2, %xmm2 # ymm2 = 0 : 0 : 0 : 0
vmovhlps %xmm0, %xmm0, %xmm1 # ymm1 = arg1.ms64 : arg1.ms64 : 0 : 0
vpcmpgtq %xmm1, %xmm2, %xmm1 # ymm1 = arg1.sign : arg1.sign : 0 : 0
vperm2f128 $32, %ymm1, %ymm0, %ymm0 # ymm1 = arg1.ls64 : arg1.ms64 : sign : sign
vmovapd %ymm0, (%rdi) # arg0 = arg1 (sign-extended to 256-bits)
xorq %rax, %rax # rax = 0 == no error
ret # return from function
I'm not certain, but not needing to read those two 64-bit shift-counts from memory might also speed the code up slightly. Nice.