Question

I load two SSE 128bit registers with 16 bit values. The values are in the following order:

src[0] = [E_3, O_3, E_2, O_2, E_1, O_1, E_0, O_0]
src[1] = [E_7, O_7, E_6, O_6, E_5, O_5, E_4, O_4]

What I want to achieve is an order like this:

src[0] = [E_7, E_6, E_5, E_4, E_3, E_2, E_1, E_0]
src[1] = [O_7, O_6, O_5, O_4, O_3, O_2, O_1, O_0]

Did you know if there is a good way to do this (by using SSE intrinsics up to SSE 4.2)?

I'm stuck at the moment, because I can't shuffle 16 bit values between the upper and lower half of the 128bit register. I found only the _mm_shufflelo_epi16 and _mm_shufflehi_epi16 intrinsics.

Update:

Thanks to Paul, I have thought about to use the epi8 intrinsics for the 16bit values.

My solution is the following:

shuffle_split = _mm_set_epi8(15, 14, 11, 10,  7,  6,  3,  2, 13, 12,  9,  8,  5,  4,  1,  0);

xtmp[0] = _mm_load_si128(src_vec);
xtmp[1] = _mm_load_si128(src_vec+1);
xtmp[0] = _mm_shuffle_epi8(xtmp[0], shuffle_split);
xtmp[1] = _mm_shuffle_epi8(xtmp[1], shuffle_split);

xsrc[0] = _mm_unpacklo_epi16(xtmp[0], xtmp[1]);
xsrc[0] = _mm_shuffle_epi8(xsrc[0], shuffle_split);
xsrc[1] = _mm_unpackhi_epi16(xtmp[0], xtmp[1]);
xsrc[1] = _mm_shuffle_epi8(xsrc[1], shuffle_split);

Is there still a better solution?

Was it helpful?

Solution

Permutations in SSE are not easy. There are many ways to achieve the same results with various combinations of instructions. Different combinations might require varying numbers of instructions, registers, or memory accesses. Rather than struggle to solve puzzles like this manually, I prefer to just see what the LLVM compiler does, so I wrote a simple version of your desired permutation in LLVM's intermediate language, which takes advantage of an extremely flexible vector shuffle instruction:

define void @shuffle_even_odd(<8 x i16>* %src0) {
  %src1 = getelementptr <8 x i16>* %src0, i64 1
  %a = load <8 x i16>* %src0, align 16
  %b = load <8 x i16>* %src1, align 16
  %x = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
  %y = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
  store <8 x i16> %x, <8 x i16>* %src0, align 16
  store <8 x i16> %y, <8 x i16>* %src1, align 16
  ret void
}

Compile this using the LLVM IR-to-ASM compiler: llc shuffle_even_odd.ll -o shuffle_even_odd.s and you get something like the following x86 assembly:

movdqa  (%rdi), %xmm0
movdqa  16(%rdi), %xmm1
movdqa  %xmm1, %xmm2
pshufb  LCPI0_0(%rip), %xmm2
movdqa  %xmm0, %xmm3
pshufb  LCPI0_1(%rip), %xmm3
por %xmm2, %xmm3
movdqa  %xmm3, (%rdi)
pshufb  LCPI0_2(%rip), %xmm1
pshufb  LCPI0_3(%rip), %xmm0
por %xmm1, %xmm0
movdqa  %xmm0, 16(%rdi)

I've excluded the constant data sections referenced by LCPIO_* above, but this roughly translates to the following C code:

void shuffle_even_odd(__m128i * src) {
    __m128i shuffle0 = _mm_setr_epi8(128, 128, 128, 128, 128, 128, 128, 128, 2, 3, 6, 7, 10, 11, 14, 15);
    __m128i shuffle1 = _mm_setr_epi8(2, 3, 6, 7, 10, 11, 14, 15, 128, 128, 128, 128, 128, 128, 128, 128);
    __m128i shuffle2 = _mm_setr_epi8(128, 128, 128, 128, 128, 128, 128, 128, 0, 1, 4, 5, 8, 9, 12, 13);
    __m128i shuffle3 = _mm_setr_epi8(0, 1, 4, 5, 8, 9, 12, 13, 128, 128, 128, 128, 128, 128, 128, 128);
    __m128i a = src[0];
    __m128i b = src[1];
    src[0] = _mm_or_si128(_mm_shuffle_epi8(b, shuffle0), _mm_shuffle_epi8(a, shuffle1));
    src[1] = _mm_or_si128(_mm_shuffle_epi8(b, shuffle2), _mm_shuffle_epi8(a, shuffle3));
}

That's only 4 shuffle and 2 bitwise-or instructions. I would suspect those bitwise instructions can be scheduled more efficiently in the CPU pipeline than your proposed unpack instructions.

You can find the “llc” compiler in the “Clang Binaries” package from LLVM's download page: http://www.llvm.org/releases/download.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top