SSE half loads (_mm_loadh_pi / _mm_loadl_pi) issue warnings

Question 1

I tried 3 compilers: MS Visual Studio 2012, gcc481, and Intel icl 13.1. They all warn as you point out. I found that both gcc and MS automatically generate initialization code for tmp1, even as they warn about lack of initialization. The MS compiler generates an undesirable memory access: movaps xmm0,xmmword ptr [rsp]. Gcc generates a more efficient xorps xmm0,xmm0. So in the case of gcc, adding tmp1=_mm_setzero_ps() eliminates the warning and produces exactly the same code as without. In the case of MS, adding tmp1=_mm_setzero_ps() makes the code shorter and probably faster. Only the Intel compiler is smart enough to avoid the unneeded initialization. Here is a possible workaround for MS and gcc compilers:

    __m128 tmp1 = _mm_loadh_pi(_mm_load_ps (src), (__m64*)(src + 4));

Code generation is:

movaps      xmm0,xmmword ptr [rcx]
movhps      xmm0,qword ptr [rcx+10h]

It looks shorter, but should be benchmarked to make sure it is faster.

09/12/2013: test code for different warning suppression ideas:

#include <xmmintrin.h>
#include <stdint.h>
#include <stdio.h>

//---------------------------------------------------------------------------
// original code from http://download.intel.com/design/PentiumIII/sml/24504301.pdf
__m128 func1 (float *src)
    {
    __m128 tmp1;
    tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ 4));
    return tmp1;
    }

//---------------------------------------------------------------------------
// original code plus tmp1 initialization
__m128 func2 (float *src)
    {
    __m128 tmp1 = _mm_loadh_pi(_mm_loadl_pi (_mm_setzero_ps (), (__m64*)(src)), (__m64*)(src + 4));
    return tmp1;
    }

//---------------------------------------------------------------------------
// use redundant load to eliminate warning 
__m128 func3 (float *src)
    {
    __m128 tmp1 = _mm_loadh_pi(_mm_load_ps (src), (__m64*)(src + 4));
    return tmp1;
    }

//---------------------------------------------------------------------------

static void dump (void *data)
    {
    float *f16 = data;
    int index;

    for (index = 0; index < 4; index++)
        printf ("%g ", f16 [index]);
    printf ("\n");
    }

//---------------------------------------------------------------------------

int main (void)
    {
    float f [8] = {1, 2, 3, 4, 5, 6, 7, 8};
    __m128 tmp;

    tmp = func1 (f);
    dump (&tmp);
    tmp = func2 (f);
    dump (&tmp);
    tmp = func3 (f);
    dump (&tmp);
    return 0;
    }

build commands:

gcc  -O3 -Wall -Wfatal-errors sample.c -osample.exe
objdump -Mintel --disassemble sample.exe > disasm.txt

cl -Ox -Zi -W4 sample.c
dumpbin -disasm -symbols sample.exe > disasm.txt

icl -Ox -Zi sample.c                                           
dumpbin -disasm -symbols sample.exe > disasm.txt

Question 2

This is what _mm_undefined_ps is for (but it only actually helps code-gen with Intel's compiler. Other compilers usually treat it similarly to _mm_setzero_ps).

And besides that, you want a movsd load of two floats that zero-extends and breaks the false dependency on the old value of a register, not a movlps that merges. (Unless you're building for a crusty old 32-bit CPU that has SSE1 but not SSE2, like the PIII your code was originally written for.)

Cast to double * and use _mm_load_sd. You're not dereferencing it yourself, only via _mm_load_sd, so I think this is still 100% strict-aliasing safe. It works in practice on current compilers, though! If it turns out to be unsafe, _mm_loadl_epi64 (movq) takes a __m128i const* arg (weird because it only loads the low 64 bits, but it's a may_alias type you can definitely use safely to read any other type, like char*.)

static inline
__m128 stride_gather(float *src) {
    __m128 tmp1 = _mm_castpd_ps(_mm_load_sd((const double*)src));  // movsd
    tmp1 = _mm_loadh_pi(tmp1, (const __m64*)(src+4));              // movhps
    return tmp1;
}

gcc7 and later use movq instead of movsd, which is weird but I think it's fine. At worst an extra cycle of bypass-delay latency as an input to movhps on some old CPUs, but not a throughput penalty.

The other 3 major compilers (clang/ICC/MSVC) all compile this to the expected movsd / movhps with no false dependency on the old value of xmm0. (source+asm output on the Godbolt compiler explorer.)