Store four 16bit integers with SSE intrinsics
-
21-12-2019 - |
質問
I multiply and round four 32bit floats, then convert it to four 16bit integers with SSE intrinsics. I'd like to store the four integer results to an array. With floats it's easy: _mm_store_ps(float_ptr, m128value)
. However I haven't found any instruction to do this with 16bit (__m64) integers.
void process(float *fptr, int16_t *sptr, __m128 factor)
{
__m128 a = _mm_load_ps(fptr);
__m128 b = _mm_mul_ps(a, factor);
__m128 c = _mm_round_ps(b, _MM_FROUND_TO_NEAREST_INT);
__m64 s =_mm_cvtps_pi16(c);
// now store the values to sptr
}
Any help would be appreciated.
解決
Personally I would avoid using MMX. Also, I would use an explicit store rather than implicit which often only work on certain compilers. The following codes works find in MSVC2012 and SSE 4.1.
Note that fptr
needs to be 16-byte aligned. This is not a problem if you compile in 64-bit mode but in 32-bit mode you should make sure it's aligned.
#include <stdio.h>
#include <stdint.h>
#include <smmintrin.h>
void process(float *fptr, int16_t *sptr, __m128 factor)
{
__m128 a = _mm_load_ps(fptr);
__m128 b = _mm_mul_ps(a, factor);
__m128i c = _mm_cvttps_epi32(b);
__m128i d = _mm_packs_epi32(c,c);
_mm_storel_epi64((__m128i*)sptr, d);
}
int main() {
float x[] = {1.0, 2.0, 3.0, 4.0};
int16_t y[4];
__m128 factor = _mm_set1_ps(3.14159f);
process(x, y, factor);
printf("%d %d %d %d\n", y[0], y[1], y[2], y[3]);
}
Note that _mm_cvtps_pi16
is not a simple instrinsic the Intel Intrinsic Guide says "This intrinsic creates a sequence of two or more instructions, and may perform worse than a native instruction. Consider the performance impact of this intrinsic."
Here is the assembly output using the MMX version
mulps (%rdi), %xmm0
roundps $0, %xmm0, %xmm0
movaps %xmm0, %xmm1
cvtps2pi %xmm0, %mm0
movhlps %xmm0, %xmm1
cvtps2pi %xmm1, %mm1
packssdw %mm1, %mm0
movq %mm0, (%rsi)
ret
Here is the assembly output ussing the SSE only version
mulps (%rdi), %xmm0
cvttps2dq %xmm0, %xmm0
packssdw %xmm0, %xmm0
movq %xmm0, (%rsi)
ret
他のヒント
With __m64
types, you can just cast the destination pointer appropriately:
void process(float *fptr, int16_t *sptr, __m128 factor)
{
__m128 a = _mm_load_ps(fptr);
__m128 b = _mm_mul_ps(a, factor);
__m128 c = _mm_round_ps(b, _MM_FROUND_TO_NEAREST_INT);
__m64 s =_mm_cvtps_pi16(c);
*((__m64 *) sptr) = s;
}
There is no distinction between aligned and unaligned stores with MMX instructions like there is with SSE/AVX; therefore, you don't need the intrinsics to perform a store.
I think you're safe moving that to a general 64bit register (long long
will work for both Linux LLP64 and Windows LP64) and copy it yourself.
From what I read in xmmintrin.h
, gcc will handle the cast perfectly fine from __m64
to a long long
.
To be sure, you can use _mm_cvtsi64_si64x
.
short* f;
long long b = _mm_cvtsi64_si64x(s);
f[0] = b >> 48;
f[1] = b >> 32 & 0x0000FFFFLL;
f[2] = b >> 16 & 0x000000000FFFFLL;
f[3] = b & 0x000000000000FFFFLL;
You could type pune that with an union to make it look better, but I guess that would fall in undefined behavior.