Shifting SSE/AVX registers 32 bits left and right while shifting in zeros

https://stackoverflow.com/questions/19516585

01-07-2022
|

Question

I want to shift SSE/AVX registers multiples of 32 bits left or right while shifting in zeros.

Let me be more precise on the shifts I'm interested in. For SSE I want to do the following shifts of four 32bit floats:

shift1_SSE: [1, 2, 3, 4] -> [0, 1, 2, 3]
shift2_SSE: [1, 2, 3, 4] -> [0, 0, 1, 2]

For AVX I want to shift do the following shifts:

shift1_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 1, 2, 3, 4, 5, 6, 7]
shift2_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 0, 1, 2, 3, 4, 5, 6]
shift3_AVX: [1, 2, 3, 4 ,5 ,6, 7, 8] -> [0, 0, 0, 0, 1, 2, 3, 4]

For SSE I have come up with the following code

shift1_SSE = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)); 
shift2_SSE = _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40);
//shift2_SSE = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8));

Is there a better way to do this with SSE?

For AVX I have come up with the following code which needs AVX2 (and it's untested). Edit (as explained by Paul R this code won't work).

shift1_AVX2 =_mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 4)));
shift2_AVX2 =_mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 8)));
shift3_AVX2 =_mm256_castsi256_ps(_mm256_slli_si256(_mm256_castps_si256(x), 12)));

How can I do this best with AVX not AVX2 (for example with _mm256_permute or _mm256_shuffle`)? Is there a better way to do this with AVX2?

Edit:

Paul R has informed me that my AVX2 code won't work and that AVX code is probably not worth it. Instead for AVX2 I should use _mm256_permutevar8x32_ps along with _mm256_and_ps. I don't have a system with AVX2 (Haswell) so this is hard to test.

Edit: Based on Felix Wyss's answer I came up with some solutions for AVX which only needs 3 intrisnics for shift1_AVX and shift2_AVX and only one intrinsic for shift3_AVX. This is due to the fact that _mm256_permutef128Ps has a zeroing feature.

shift1_AVX

__m256 t0 = _mm256_permute_ps(x, _MM_SHUFFLE(2, 1, 0, 3));       
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 41);          
__m256 y = _mm256_blend_ps(t0, t1, 0x11);

shift2_AVX

__m256 t0 = _mm256_permute_ps(x, _MM_SHUFFLE(1, 0, 3, 2));
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 41);
__m256 y = _mm256_blend_ps(t0, t1, 0x33);

shift3_AVX

x = _mm256_permute2f128_ps(x, x, 41);

La solution 2

Your SSE implementation is fine but I suggest you use the _mm_slli_si128 implementation for both of the shifts - the casts make it look complicated but it really boils down to just one instruction for each shift.

Your AVX2 implementation won't work unfortunately. Almost all AVX instructions are effectively just two SSE instructions in parallel operating on two adjacent 128 bit lanes. So for your first shift_AVX2 example you'd get:

0, 0, 1, 2, 0, 4, 5, 6
----------- ----------
 LS lane     MS lane

All is not lost however: one of the few instructions which does work across lanes on AVX is _mm256_permutevar8x32_ps. Note that you'll need to use an _mm256_and_ps in conjunction with this to zero the shifted in elements. Note also that this is an AVX2 solution — AVX on its own is very limited for anything other than basic arithmetic/logic operations so I think you'll have a hard time doing this efficiently without AVX2.

Autres conseils

You can do a shift right with _mm256_permute_ps, _mm256_permute2f128_ps, and _mm256_blend_ps as follows:

__m256 t0 = _mm256_permute_ps(x, 0x39);            // [x4  x7  x6  x5  x0  x3  x2  x1]
__m256 t1 = _mm256_permute2f128_ps(t0, t0, 0x81);  // [ 0   0   0   0  x4  x7  x6  x5] 
__m256 y  = _mm256_blend_ps(t0, t1, 0x88);         // [ 0  x7  x6  x5  x4  x3  x2  x1]

The result is in y. In order to do a rotate right, set the permute mask to 0x01 instead of 0x81. Shift/rotate left and larger shifts/rotates can be done similarly by changing the permute and blend control bytes.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow