Question

I am writing a code library in x86-64 assembly-language to provide all conventional bitwise, shift, logical, compare, arithmetic and math functions for s0128, s0256, s0512, s1024, s2048, and s4096 signed-integer types and f0128, f0256, f0512, f1024, f2048, and f4096 floating-point types.

Now I am writing some type conversion routines, and have run into something that should be trivial but takes a lot more instructions than I would expect. I feel like I must be missing something (some instructions) to make this easier, but so far no luck.

The low 128-bits of the s0256 result is simply a copy of the s0128 input argument, and all the bits in the upper 128-bits of the s0256 result must be set to the most-significant bit in the s0128 input argument.

Simple, huh? But here is the best I can figure so far to convert s0256 to s0128. Ignore the first 4 lines (they're just argument error checks) and last 2 lines (returning from function with no error (rax == 0)). The 5 lines in the middle are the algorithm in question. Try to avoid [conditional] jump instructions.

.text
.align 64
big_m63:
.quad  -63, -63                       # two shift counts for vpshaq instruction

big_s0256_eq_s0128:    # (s0256* arg0, const s0128* arg1); # s0256 = s0256(s0128)
  orq        %rdi, %rdi               # is arg0 a valid address ???
  jz         error_argument_invalid   # nope
  orq        %rsi, %rsi               # is arg1 a valid address ???
  jz         error_argument_invalid   # nope

  vmovapd    (%rsi), %xmm0            # ymm0 = arg1.ls64 : arg1.ms64 : 0 : 0
  vmovhlps   %xmm0, %xmm0, %xmm1      # ymm1 = arg1.ms64 : arg1.ms64 : 0 : 0
  vpshaq     big_m63, %xmm1, %xmm1    # ymm1 = arg1.sign : arg1.sign : 0 : 0
  vperm2f128 $32, %ymm1, %ymm0, %ymm0 # ymm1 = arg1.ls64 : arg1.ms64 : sign : sign
  vmovapd    %ymm0, (%rdi)            # arg0 = arg1 (sign-extended to 256-bits)

  xorq       %rax, %rax               # rax = 0 == no error
  ret                                 # return from function

This routine is also non-optimal in that every instruction requires the result of the previous instruction, which prevents parallel execution of any instructions.

Is there a better instruction to right-shift with sign extension? I cannot find an instruction like vpshaq that accepts an immediate byte to specify shift-count, though I don't know why (many SIMD instructions have immediate 8-bit operands for various purposes). Also, Intel does not support vpshaq. Oops!

But look! StephenCanon has a brilliant solution to this problem below! Awesome! That solution has one more instruction than the above, but the vpxor instruction can be put after the first vmovapd instruction and should effectively take no more cycles than the 5 instruction version above. Bravo!

For completeness and easy comparison, here is the code with the latest StephenCanon enhancement:

.text
.align 64
big_s0256_eq_s0128:    # (s0256* arg0, const s0128* arg1); # s0256 = s0256(s0128)
  orq        %rdi, %rdi               # is arg0 a valid address ???
  jz         error_argument_invalid   # nope
  orq        %rsi, %rsi               # is arg1 a valid address ???
  jz         error_argument_invalid   # nope

  vmovapd    (%rsi), %xmm0            # ymm0 = arg1.ls64 : arg1.ms64 : 0 : 0
  vpxor      %xmm2, %xmm2, %xmm2      # ymm2 = 0 : 0 : 0 : 0
  vmovhlps   %xmm0, %xmm0, %xmm1      # ymm1 = arg1.ms64 : arg1.ms64 : 0 : 0
  vpcmpgtq   %xmm1, %xmm2, %xmm1      # ymm1 = arg1.sign : arg1.sign : 0 : 0
  vperm2f128 $32, %ymm1, %ymm0, %ymm0 # ymm1 = arg1.ls64 : arg1.ms64 : sign : sign
  vmovapd    %ymm0, (%rdi)            # arg0 = arg1 (sign-extended to 256-bits)

  xorq       %rax, %rax               # rax = 0 == no error
  ret                                 # return from function

I'm not certain, but not needing to read those two 64-bit shift-counts from memory might also speed the code up slightly. Nice.

Was it helpful?

Solution

You're over-complicating things. Once you have the sign in rax, just do two 64b stores from there instead of trying to assemble the result in ymm0. One less instruction and a much shorter dependency chain.

As the destination type gets larger, of course, it makes sense to use wider stores (AVX). With AVX2 you can use vbroadcastq to do the splat more efficiently, but it looks like you're targeting baseline AVX?

I should also note that once you get to ~512b integers, for most algorithms the cost of super-linear operations like multiplication so completely dominates the running time that squeezing every last cycle out of operations like sign extension rapidly starts to lose value. It's a good exercise, but ultimately not the most productive use of your time once your implementations are "good enough”.


After further thought, I have the following suggestion:

vmovhlps  %xmm0, %xmm0, %xmm1 // could use a permute instead to stay in integer domain.
vpxor     %xmm2, %xmm2, %xmm2
vpcmpgtq  %xmm1, %xmm2, %xmm2 // generate sign-extension without shift

This has the virtues of (a) not requiring a constant load and (b) working on both Intel and AMD. The xor to generate zero looks like an extra instruction, but in practice this zeroing idiom doesn’t even require an execute slot on recent processors.


FWIW, if targeting AVX2, I might write it like this:

vmovdqa (%rsi),        %xmm0 // { x0, x1, 0,  0  }
vpermq   $0x5f, %ymm0, %ymm1 // { 0,  0,  x1, x1 }
vpxor    %ymm2, %ymm2, %ymm2 // { 0,  0,  0,  0  }
vpcmpgtq %ymm1, %ymm2, %ymm2 // { 0,  0,  s,  s  } s = sign extension
vpor     %ymm2, %ymm0, %ymm0 // { x0, x1, s,  s  }
vmovdqa  %ymm0,       (%rdi)

Unfortunately, I don’t think that vpermq is available on AMD.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top