Permutations in SSE are not easy. There are many ways to achieve the same results with various combinations of instructions. Different combinations might require varying numbers of instructions, registers, or memory accesses. Rather than struggle to solve puzzles like this manually, I prefer to just see what the LLVM compiler does, so I wrote a simple version of your desired permutation in LLVM's intermediate language, which takes advantage of an extremely flexible vector shuffle instruction:
define void @shuffle_even_odd(<8 x i16>* %src0) {
%src1 = getelementptr <8 x i16>* %src0, i64 1
%a = load <8 x i16>* %src0, align 16
%b = load <8 x i16>* %src1, align 16
%x = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
%y = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
store <8 x i16> %x, <8 x i16>* %src0, align 16
store <8 x i16> %y, <8 x i16>* %src1, align 16
ret void
}
Compile this using the LLVM IR-to-ASM compiler: llc shuffle_even_odd.ll -o shuffle_even_odd.s
and you get something like the following x86 assembly:
movdqa (%rdi), %xmm0
movdqa 16(%rdi), %xmm1
movdqa %xmm1, %xmm2
pshufb LCPI0_0(%rip), %xmm2
movdqa %xmm0, %xmm3
pshufb LCPI0_1(%rip), %xmm3
por %xmm2, %xmm3
movdqa %xmm3, (%rdi)
pshufb LCPI0_2(%rip), %xmm1
pshufb LCPI0_3(%rip), %xmm0
por %xmm1, %xmm0
movdqa %xmm0, 16(%rdi)
I've excluded the constant data sections referenced by LCPIO_* above, but this roughly translates to the following C code:
void shuffle_even_odd(__m128i * src) {
__m128i shuffle0 = _mm_setr_epi8(128, 128, 128, 128, 128, 128, 128, 128, 2, 3, 6, 7, 10, 11, 14, 15);
__m128i shuffle1 = _mm_setr_epi8(2, 3, 6, 7, 10, 11, 14, 15, 128, 128, 128, 128, 128, 128, 128, 128);
__m128i shuffle2 = _mm_setr_epi8(128, 128, 128, 128, 128, 128, 128, 128, 0, 1, 4, 5, 8, 9, 12, 13);
__m128i shuffle3 = _mm_setr_epi8(0, 1, 4, 5, 8, 9, 12, 13, 128, 128, 128, 128, 128, 128, 128, 128);
__m128i a = src[0];
__m128i b = src[1];
src[0] = _mm_or_si128(_mm_shuffle_epi8(b, shuffle0), _mm_shuffle_epi8(a, shuffle1));
src[1] = _mm_or_si128(_mm_shuffle_epi8(b, shuffle2), _mm_shuffle_epi8(a, shuffle3));
}
That's only 4 shuffle and 2 bitwise-or instructions. I would suspect those bitwise instructions can be scheduled more efficiently in the CPU pipeline than your proposed unpack instructions.
You can find the “llc” compiler in the “Clang Binaries” package from LLVM's download page: http://www.llvm.org/releases/download.html