Question

What is the quickest way to perform a rotate operation on the entirety of a YMM register, by an amount known only at runtime?

The rotation is known to be by a multiple of 64 bits.

Était-ce utile?

La solution

With AVX2 you can use _mm256_permutevar8x32_epi32. Pseudo-code (not tested, constants are likely wrong):

static inline __m256i rotate(__m256i x, unsigned n) {
    static const __m256i rotspec[4] = {
        _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7),
        _mm256_set_epi32(6, 7, 0, 1, 2, 3, 4, 5),
        _mm256_set_epi32(4, 5, 6, 7, 0, 1, 2, 3),
        _mm256_set_epi32(2, 3, 4, 5, 6, 7, 0, 1)
    };
    return _mm256_permutevar8x32_epi32(x, rotspec[n]);
}

Autres conseils

You can rotate right with AVX as follows. Assuming your input is x:

__m256d t0 = _mm256_permute_pd(x, 0x05);            // [x2  x3  x0  x1]
__m256d t1 = _mm256_permute2f128_pd(t0, t0, 0x01);  // [x0  x1  x2  x3]
__m256d y  = _mm256_blend_pd(t0, t1, 0x0a);         // [x0  x3  x2  x1]

The result is in y. By inverting the blend mask you can rotate left:

__m256d t0 = _mm256_permute_pd(x, 0x05);            // [x2  x3  x0  x1]
__m256d t1 = _mm256_permute2f128_pd(t0, t0, 0x01);  // [x0  x1  x2  x3]
__m256d y  = _mm256_blend_pd(t0, t1, 0x05);         // [x2  x1  x0  x3]

If you are limited to AVX instructions, you can still use the conditional blend instruction (VBLENDVPD) to select the correct rotation without using a switch. This is probably faster, especially if the condition cannot be easily predicted.

The full implementation of the right rotation (tested):

// rotate packed double vector right by n
__m256d rotate_pd_right(__m256d x, int n) {
    __m128i c = _mm_cvtsi32_si128(n);
    __m128i cc = _mm_unpacklo_epi64(c,c);

    // create blend masks (highest bit)
    __m128d half_low = _mm_castsi128_pd(_mm_slli_epi64(cc, 63));
    __m128d swap_low = _mm_castsi128_pd(_mm_slli_epi64(cc, 62));
    __m256d half = _mm256_insertf128_pd(_mm256_castpd128_pd256(half_low), half_low, 1);
    __m256d swap = _mm256_insertf128_pd(_mm256_castpd128_pd256(swap_low), swap_low, 1);

    // compute rotations
    __m256d t0 = _mm256_permute_pd(x, 0x05);            // [2 3 0 1]
    __m256d t1 = _mm256_permute2f128_pd(t0, t0, 0x01);  // [1 0 2 3]

    __m256d y0 = x;                                     // [3 2 1 0]
    __m256d y1 = _mm256_blend_pd(t0, t1, 0x0a);         // [0 3 2 1]
    __m256d y2 = _mm256_permute2f128_pd(x, x, 0x01);    // [1 0 3 2]
    __m256d y3 = _mm256_blend_pd(t0, t1, 0x05);         // [2 1 0 3]

    // select correct rotation
    __m256d y01 = _mm256_blendv_pd(y0, y1, half);
    __m256d y23 = _mm256_blendv_pd(y2, y3, half);
    __m256d yn  = _mm256_blendv_pd(y01, y23, swap);

    return yn;
}

Left rotation can be done simply as

__m256d rotate_pd_left(__m256d x, int n) {
    return rotate_pd_right(x, -n);
}

There are four rotates: 0-bits, 64-bits, 128-bits, and 192-bits. 0-bits is trivial. Felix Whyss's solution is fine for 64-bits and 192-bits for AVX. But for 128-bits rotates you can simply swap the high and low 128bit words. This is the best solution for AVX and AVX2.

_mm256_permute2f128_pd(x, x, 0x01)

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top