Store four 16bit integers with SSE intrinsics

https://stackoverflow.com//questions/22041999

21-12-2019
|

質問

I multiply and round four 32bit floats, then convert it to four 16bit integers with SSE intrinsics. I'd like to store the four integer results to an array. With floats it's easy: _mm_store_ps(float_ptr, m128value). However I haven't found any instruction to do this with 16bit (__m64) integers.

void process(float *fptr, int16_t *sptr, __m128 factor)
{
  __m128 a = _mm_load_ps(fptr);
  __m128 b = _mm_mul_ps(a, factor);
  __m128 c = _mm_round_ps(b, _MM_FROUND_TO_NEAREST_INT);
  __m64 s =_mm_cvtps_pi16(c);
  // now store the values to sptr
}

Any help would be appreciated.

解決

Personally I would avoid using MMX. Also, I would use an explicit store rather than implicit which often only work on certain compilers. The following codes works find in MSVC2012 and SSE 4.1.

Note that fptr needs to be 16-byte aligned. This is not a problem if you compile in 64-bit mode but in 32-bit mode you should make sure it's aligned.

#include <stdio.h>
#include <stdint.h>
#include <smmintrin.h>

void process(float *fptr, int16_t *sptr, __m128 factor)
{
  __m128 a = _mm_load_ps(fptr);
  __m128 b = _mm_mul_ps(a, factor);
  __m128i c = _mm_cvttps_epi32(b);
  __m128i d = _mm_packs_epi32(c,c);
  _mm_storel_epi64((__m128i*)sptr, d);
}

int main() {
    float x[] = {1.0, 2.0, 3.0, 4.0};
    int16_t y[4];
    __m128 factor = _mm_set1_ps(3.14159f);
    process(x, y, factor);
    printf("%d %d %d %d\n", y[0], y[1], y[2], y[3]);
}

Note that _mm_cvtps_pi16 is not a simple instrinsic the Intel Intrinsic Guide says "This intrinsic creates a sequence of two or more instructions, and may perform worse than a native instruction. Consider the performance impact of this intrinsic."

Here is the assembly output using the MMX version

mulps   (%rdi), %xmm0
roundps $0, %xmm0, %xmm0
movaps  %xmm0, %xmm1
cvtps2pi    %xmm0, %mm0
movhlps %xmm0, %xmm1
cvtps2pi    %xmm1, %mm1
packssdw    %mm1, %mm0
movq    %mm0, (%rsi)
ret

Here is the assembly output ussing the SSE only version

mulps   (%rdi), %xmm0
cvttps2dq   %xmm0, %xmm0
packssdw    %xmm0, %xmm0
movq    %xmm0, (%rsi)
ret

他のヒント

With __m64 types, you can just cast the destination pointer appropriately:

void process(float *fptr, int16_t *sptr, __m128 factor)
{
  __m128 a = _mm_load_ps(fptr);
  __m128 b = _mm_mul_ps(a, factor);
  __m128 c = _mm_round_ps(b, _MM_FROUND_TO_NEAREST_INT);
  __m64 s =_mm_cvtps_pi16(c);
  *((__m64 *) sptr) = s;
}

There is no distinction between aligned and unaligned stores with MMX instructions like there is with SSE/AVX; therefore, you don't need the intrinsics to perform a store.

I think you're safe moving that to a general 64bit register (long long will work for both Linux LLP64 and Windows LP64) and copy it yourself.

From what I read in xmmintrin.h, gcc will handle the cast perfectly fine from __m64 to a long long. To be sure, you can use _mm_cvtsi64_si64x.

short* f;
long long b = _mm_cvtsi64_si64x(s);
f[0] = b >> 48;
f[1] = b >> 32 & 0x0000FFFFLL;
f[2] = b >> 16 & 0x000000000FFFFLL;
f[3] = b & 0x000000000000FFFFLL;

You could type pune that with an union to make it look better, but I guess that would fall in undefined behavior.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow