Shuffle even and odd vaues in SSE register

Question

Permutations in SSE are not easy. There are many ways to achieve the same results with various combinations of instructions. Different combinations might require varying numbers of instructions, registers, or memory accesses. Rather than struggle to solve puzzles like this manually, I prefer to just see what the LLVM compiler does, so I wrote a simple version of your desired permutation in LLVM's intermediate language, which takes advantage of an extremely flexible vector shuffle instruction:

define void @shuffle_even_odd(<8 x i16>* %src0) {
  %src1 = getelementptr <8 x i16>* %src0, i64 1
  %a = load <8 x i16>* %src0, align 16
  %b = load <8 x i16>* %src1, align 16
  %x = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
  %y = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
  store <8 x i16> %x, <8 x i16>* %src0, align 16
  store <8 x i16> %y, <8 x i16>* %src1, align 16
  ret void
}

Compile this using the LLVM IR-to-ASM compiler: llc shuffle_even_odd.ll -o shuffle_even_odd.s and you get something like the following x86 assembly:

movdqa  (%rdi), %xmm0
movdqa  16(%rdi), %xmm1
movdqa  %xmm1, %xmm2
pshufb  LCPI0_0(%rip), %xmm2
movdqa  %xmm0, %xmm3
pshufb  LCPI0_1(%rip), %xmm3
por %xmm2, %xmm3
movdqa  %xmm3, (%rdi)
pshufb  LCPI0_2(%rip), %xmm1
pshufb  LCPI0_3(%rip), %xmm0
por %xmm1, %xmm0
movdqa  %xmm0, 16(%rdi)

I've excluded the constant data sections referenced by LCPIO_* above, but this roughly translates to the following C code:

void shuffle_even_odd(__m128i * src) {
    __m128i shuffle0 = _mm_setr_epi8(128, 128, 128, 128, 128, 128, 128, 128, 2, 3, 6, 7, 10, 11, 14, 15);
    __m128i shuffle1 = _mm_setr_epi8(2, 3, 6, 7, 10, 11, 14, 15, 128, 128, 128, 128, 128, 128, 128, 128);
    __m128i shuffle2 = _mm_setr_epi8(128, 128, 128, 128, 128, 128, 128, 128, 0, 1, 4, 5, 8, 9, 12, 13);
    __m128i shuffle3 = _mm_setr_epi8(0, 1, 4, 5, 8, 9, 12, 13, 128, 128, 128, 128, 128, 128, 128, 128);
    __m128i a = src[0];
    __m128i b = src[1];
    src[0] = _mm_or_si128(_mm_shuffle_epi8(b, shuffle0), _mm_shuffle_epi8(a, shuffle1));
    src[1] = _mm_or_si128(_mm_shuffle_epi8(b, shuffle2), _mm_shuffle_epi8(a, shuffle3));
}

That's only 4 shuffle and 2 bitwise-or instructions. I would suspect those bitwise instructions can be scheduled more efficiently in the CPU pipeline than your proposed unpack instructions.

You can find the “llc” compiler in the “Clang Binaries” package from LLVM's download page: http://www.llvm.org/releases/download.html