Question

The _mm_set_epi64 and similar *_epi64 instructions seem to use and depend on __m64 types. I want to initialize a variable of type __m128 such that the upper 64 bits of it are 0, and the lower 64 bits of it are set to x, where x is of type uint64_t (or similar unsigned 64-bit type). What's the "right" way of doing so?

Preferably, this should be done in a compiler-independent manner.

Was it helpful?

Solution

To answser your question about how to load a 64-bit value into the lower 64-bits of a XMM register while zeroing the upper 64-bits _mm_loadl_epi64(&x) will do exactly what you want.

In regards to _mm_set_epi64 I said once that looking at the source code of Agner Fog's Vector Class Library can answer 95% of the question on SSE/AVX on SO. Agner implemented this (from the file vectori128.h) for multiple compilers and for 64-bit and 32-bit. Note that the solution for MSVC 32-bit Agner says "this is inefficient, but other solutions are worse". I guess that's what Mysticial means by "There isn't a good way to do it.".

Vec2q(int64_t i0, int64_t i1) {
#if defined (_MSC_VER) && ! defined(__INTEL_COMPILER)
        // MS compiler has no _mm_set_epi64x in 32 bit mode
#if defined(__x86_64__)                                    // 64 bit mode
#if _MSC_VER < 1700
        __m128i x0 = _mm_cvtsi64_si128(i0);                // 64 bit load
        __m128i x1 = _mm_cvtsi64_si128(i1);                // 64 bit load
        xmm = _mm_unpacklo_epi64(x0,x1);                   // combine
#else
        xmm = _mm_set_epi64x(i1, i0);
#endif
#else   // MS compiler in 32-bit mode
        union {
            int64_t q[2];
            int32_t r[4];
        } u;
        u.q[0] = i0;  u.q[1] = i1;
        // this is inefficient, but other solutions are worse
        xmm = _mm_setr_epi32(u.r[0], u.r[1], u.r[2], u.r[3]);
#endif  // __x86_64__
#else   // Other compilers
        xmm = _mm_set_epi64x(i1, i0);
#endif
};

OTHER TIPS

The most common "standard" intrinsic for this is _mm_set_epi64x.

For platforms that lack _mm_set_epi64x you can define a replacement macro like this:

#define _mm_set_epi64x(m0, m1) _mm_set_epi64(_m_from_int64(m0), _m_from_int64(m1))

I want to initialize a variable of type __m128 ... where x is of type uint64_t

The intrinsic which takes the uint64_t is _mm_set_epi64x (as opposed to _mm_set_epi64, which takes a __m64).

I recently ran into the issue on Solaris. Sun Studio 12.3 and below lacks _mm_set_epi64x. It also lacks the work-arounds, like _mm_cvtsi64_si128 and _m_from_int64.

Here's the hack I used, if interested. The other option was to disable SSE2, which was not too appealing (and it was 3x slower in benchmarks):

// Sun Studio 12.3 and earlier lack SSE2's _mm_set_epi64 and _mm_set_epi64x.
#if defined(__SUNPRO_CC) && (__SUNPRO_CC < 0x5130)
inline __m128i _mm_set_epi64x(const uint64_t a, const uint64_t b)
{
    union INT_128_64 {
        __m128i   v128;
        uint64_t  v64[2];
    };

    INT_128_64 v;
    v.v64[0] = b; v.v64[1] = a; 
    return v.v128;
}
#endif

I believe C++11 could do additional things to help the compiler and performance, like initialize a constant array:

const INT_128_64 v = {a,b};
return v.v128;

There's a big caveat... I believe there is undefined behavior because a write occurs using the v64 member of the union, and then read occurs using the v128 member of the union. Testing under SunCC shows the compiler is doing the expected (but technically incorrect) thing.

I believe you can sidestep the undefined behavior using a memcpy, but that could crush performance. Also see Peter Cordes' answer and discussion at How to swap two __m128i variables in C++03 given its an opaque type and an array?.

The following may also be a good choice to avoid the undefined behavior from using the inactive union member. But I'm not sure about the punning.

INT_128_64 v;
v.v64[0] = b; v.v64[1] = a;
return *(reinterpret_cast<__m128i*>(v.v64));

EDIT (three months later): Solaris and SunCC did not like the punning. It produced bad code for us, and we had to memcpy the value into __m128i. Unix, Linux, Windows, GCC, Clang, ICC, MSC were all OK. Only SunCC gave us trouble.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top