SSE multiplication 16 x uint8_t

https://stackoverflow.com/questions/8193601

04-03-2021
|

Question

I want to multiply with SSE4 a __m128i object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8?

Solution

There is no 8-bit multiplication in MMX/SSE/AVX. However, you can emulate 8-bit multiplication intrinsic using 16-bit multiplication as follows:

inline __m128i _mm_mullo_epi8(__m128i a, __m128i b)
{
    __m128i zero = _mm_setzero_si128();
    __m128i Alo = _mm_cvtepu8_epi16(a);
    __m128i Ahi = _mm_unpackhi_epi8(a, zero);
    __m128i Blo = _mm_cvtepu8_epi16(b);
    __m128i Bhi = _mm_unpackhi_epi8(b, zero);
    __m128i Clo = _mm_mullo_epi16(Alo, Blo);
    __m128i Chi = _mm_mullo_epi16(Ahi, Bhi);
    __m128i maskLo = _mm_set_epi8(0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 14, 12, 10, 8, 6, 4, 2, 0);
    __m128i maskHi = _mm_set_epi8(14, 12, 10, 8, 6, 4, 2, 0, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80);
    __m128i C = _mm_or_si128(_mm_shuffle_epi8(Clo, maskLo), _mm_shuffle_epi8(Chi, maskHi));

     return C;
}

OTHER TIPS

A (potentially) faster way than Marat's solution based on Agner Fog's solution:

Instead of splitting hi/low, split odd/even. This has the added benefit that it works with pure SSE2 instead of requiring SSE4.1 (of no use to the OP, but a nice added bonus for some). I also added an optimization if you have AVX2. Technically the AVX2 optimization works with only SSE2 intrinsics, but it's slower than the shift left then right solution.

__m128i mullo_epi8(__m128i a, __m128i b)
{
    // unpack and multiply
    __m128i dst_even = _mm_mullo_epi16(a, b);
    __m128i dst_odd = _mm_mullo_epi16(_mm_srli_epi16(a, 8),_mm_srli_epi16(b, 8));
    // repack
#ifdef __AVX2__
    // only faster if have access to VPBROADCASTW
    return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_and_si128(dst_even, _mm_set1_epi16(0xFF)));
#else
    return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_srli_epi16(_mm_slli_epi16(dst_even,8), 8));
#endif
}

Agner uses the blendv_epi8 intrinsic with SSE4.1 support.

Edit:

Interestingly, after doing more disassembly work (with optimized builds), at least my two implementations get compiled to exactly the same thing. Example disassembly targeting "ivy-bridge" (AVX).

vpmullw xmm2,xmm0,xmm1
vpsrlw xmm0,xmm0,0x8
vpsrlw xmm1,xmm1,0x8
vpmullw xmm0,xmm0,xmm1
vpsllw xmm0,xmm0,0x8
vpand xmm1,xmm2,XMMWORD PTR [rip+0x281]
vpor xmm0,xmm0,xmm1

It uses the "AVX2-optimized" version with a pre-compiled 128-bit xmm constant. Compiling with only SSE2 support produces a similar results (though using SSE2 instructions). I suspect Agner Fog's original solution might get optimized to the same thing (would be crazy if it didn't). No idea how Marat's original solution compares in an optimized build, though for me having a single method for all x86 simd extensions newer than and including SSE2 is quite nice.

The only 8 bit SSE multiply instruction is PMADDUBSW (SSSE3 and later, C/C++ intrinsic: _mm_maddubs_epi16). This multiplies 16 x 8 bit unsigned values by 16 x 8 bit signed values and then sums adjacent pairs to give 8 x 16 bit signed results. If you can't use this rather specialised instruction then you'll need to unpack to pairs of 16 bit vectors and use regular 16 bit multiply instructions. Obviously this implies at least a 2x throughput hit so use the 8 bit multiply if you possibly can.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow