Why is permute needed in parallel SIMD/SSE/AVX ?

Question

Intel's instruction set manuals will be invaluable to your learning of SIMD. It explains in great detail what each of those instructions is doing.

"Packing" in SSE/AVX is basically a downcast and merge of two registers. PACKSSDW packs 32-bit signed ints from two registers into 16-bit signed ints in one register, and saturates the values (so values < -32768 will be set to -32768, and >32767 will be set to 32767)

A permute is a way of reordering the values in a register. Each value in the mask register specifies an index into the source. This is required because AVX256 "cheated" a little and processes most of its mixing instructions as two 128-bit "lanes".

The 128-bit version of PACKSSDW performs this:

r0 := SignedSaturate(a0)
r1 := SignedSaturate(a1)
r2 := SignedSaturate(a2)
r3 := SignedSaturate(a3)
r4 := SignedSaturate(b0)
r5 := SignedSaturate(b1)
r6 := SignedSaturate(b2)
r7 := SignedSaturate(b3)

You'd expect the 256-bit version to maintain the same natural ordering with all the "A"s first and the "B"s second, like this:

r0 := SignedSaturate(a0)
r1 := SignedSaturate(a1)
r2 := SignedSaturate(a2)
r3 := SignedSaturate(a3)
r4 := SignedSaturate(a4)
r5 := SignedSaturate(a5)
r6 := SignedSaturate(a6)
r7 := SignedSaturate(a7)
r8 := SignedSaturate(b0)
r9 := SignedSaturate(b1)
r10 := SignedSaturate(b2)
r11 := SignedSaturate(b3)
r12 := SignedSaturate(b4)
r13 := SignedSaturate(b5)
r14 := SignedSaturate(b6)
r15 := SignedSaturate(b7)

But instead, what it actually does this:

r0 := SignedSaturate(a0) // lane one, the low 128 bits.
r1 := SignedSaturate(a1)
r2 := SignedSaturate(a2)
r3 := SignedSaturate(a3)
r4 := SignedSaturate(b0)
r5 := SignedSaturate(b1)
r6 := SignedSaturate(b2)
r7 := SignedSaturate(b3)
r8 := SignedSaturate(a4) // lane two, the high 128 bits.
r9 := SignedSaturate(a5)
r10 := SignedSaturate(a6)
r11 := SignedSaturate(a7)
r12 := SignedSaturate(b4)
r13 := SignedSaturate(b5)
r14 := SignedSaturate(b6)
r15 := SignedSaturate(b7)

The result is that when comparing an array of neatly ordered values, the 128-bit version keeps them ordered while the 256-bit version will mix them. The permute puts them back into order.

As I alluded to in my post, you can get rid of the permute in this code by preprocessing your node's array to have the inverse, so that the "mixed" results of the 256-bit op puts it back in order:

void preprocess_avx2(bnode* const node)
{
    __m256i const perm_mask = _mm256_set_epi32(3, 2, 1, 0, 7, 6, 5, 4);
    __m256i *const middle = (__m256i*)&node->i32[4];

    __m256i x = _mm256_loadu_si256(middle);
    x = _mm256_permutevar8x32_epi32(x, perm_mask);
    _mm256_storeu_si256(middle, x);
}

The ordering is important because of what it does next.

The compare works on 16 32-bit values, but it results in either 0x0000 or 0xFFFF for all of them. You essentially only have 16 bits of information -- off or on for each value. PMOVMSKB treats the input as 32 8-byte values and packs the high bits of each (which is all we need, since all the bits are the same) into a 32-bit int.

TZCNT counts the trailing zero bits in that int, which gives an index to the first position that has a set bit: the index of the first byte in that SIMD register that compared as greater-than.

(Fun fact: TZCNT is a Haswell improvement over the existing BSF instruction, and in fact shares an encoding with it. The only difference is that TZCNT has a defined register output when its input is 0 -- with BSF you'd need to branch.)