SSE intrinsics cause normal float operation to return -1.#INV

https://stackoverflow.com/questions/9052551

04-12-2019
|

Question

I am having a problem with a SSE method I am writing that performs audio processing. I have implemented a SSE random function based on Intel's paper here:

http://software.intel.com/en-us/articles/fast-random-number-generator-on-the-intel-pentiumr-4-processor/

I also have a method that is performing conversions from Float to S16 using SSE also, the conversion is performed quite simply as follows:

unsigned int Float_S16LE(float *data, const unsigned int samples, uint8_t *dest)
{
  int16_t *dst = (int16_t*)dest;
  const __m128 mul = _mm_set_ps1((float)INT16_MAX);
   __m128 rand;
  const uint32_t even = count & ~0x3;
  for(uint32_t i = 0; i < even; i += 4, data += 4, dst += 4)
  {
    /* random round to dither */
    FloatRand4(-0.5f, 0.5f, NULL, &rand);

    __m128 rmul = _mm_add_ps(mul, rand);
    __m128 in = _mm_mul_ps(_mm_load_ps(data),rmul);
    __m64 con = _mm_cvtps_pi16(in);

    memcpy(dst, &con, sizeof(int16_t) * 4);
  }
}

FloatRand4 is defined as follows:

static inline void FloatRand4(const float min, const float max, float result[4], __m128 *sseresult = NULL)
{
  const float delta  = (max - min) / 2.0f;
  const float factor = delta / (float)INT32_MAX;
  ...
}

If sseresult != NULL the __m128 result is returned and result is unused. This performs perfectly on the first loop, but on the next loop delta becomes -1.#INF instead of 1.0. If I comment out the line __m64 con = _mm_cvtps_pi16(in); the problem goes away.

I think that the FPU is getting into an unknown state or something.

Solution

Mixing SSE Integer arithmetic and (regular) Floating point math. Can produce weird results because both are operating on the same registers. If you use:

_mm_empty()

the FPU is reset into a correct state. Microsoft has Guidelines for When to Use EMMS

OTHER TIPS

_mm_load_ps is not guaranteed to do an aligned load. float* data can be aligned to 4 bytes instead of 16 _ => _mm_loadu_ps
memcpy will probably kill the advantages achieved with SSE, you should use a store command for __m64 but here again, take care of the alignment. If it's impossible to do an unaligned stream or store of an __m64, I'd either keep it inside an _m128i and do a masked write with _mm_maskmoveu_si128 or store those 8 bytes by hand.

http://msdn.microsoft.com/en-us/library/bytwczae.aspx

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow