Question

I have two arrays: char* c and float* f and I need to do this operation:

// Compute float mask
float* f;
char* c;
char c_thresh;
int n;

for ( int i = 0; i < n; ++i )
{
    if ( c[i] < c_thresh ) f[i] = 0.0f;
    else                   f[i] = 1.0f;
}

I am looking for a fast way to do it: without conditionals and using SSE (4.2 or AVX) if possible.

If using float instead of char can result in faster code, I can change my code to use floats only:

// Compute float mask
float* f;
float* c;
float c_thresh;
int n;

for ( int i = 0; i < n; ++i )
{
    if ( c[i] < c_thresh ) f[i] = 0.0f;
    else                   f[i] = 1.0f;
}

Thanks

Était-ce utile?

La solution

Pretty easy, just do the comparison, convert bytes to dword, AND with 1.0f: (not tested, and this isn't meant to be copy&paste code anyway, it's meant to show how you do it)

movd xmm0, [c]          ; read 4 bytes from c
pcmpgtb xmm0, threshold ; compare (note: comparison is >, not >=, so adjust threshold)
pmovzxbd xmm0, xmm0     ; convert bytes to dwords
pand xmm0, one          ; AND all four elements with 1.0f
movdqa [f], xmm0        ; save result

Should be pretty easy to convert to intrinsics.

Autres conseils

The following code uses SSE2 (i think).

It performs 16 comparisons of bytes in one instruction (_mm_cmpgt_epi8). It assumes that char is signed; if your char is unsigned, it requires additional fiddling (flipping the most significant bit of each char).

The only non-standard thing it does is using the magic number 3f80 to represent the floating-point constant 1.0. The magic number is actually 0x3f800000, but the fact that the 16 LSB are zero makes it possible to do the bit fiddling more efficiently (using 16-bit masks instead of 32-bit ones).

// load (assuming the pointer is aligned)
__m128i input = *(const __m128i*)c;
// compare
__m128i cmp = _mm_cmpgt_epi8(input, _mm_set1_epi8(c_thresh - 1));
// convert to 16-bit
__m128i c0 = _mm_unpacklo_epi8(cmp, cmp);
__m128i c1 = _mm_unpackhi_epi8(cmp, cmp);
// convert ffff to 3f80
c0 = _mm_and_si128(c0, _mm_set1_epi16(0x3f80));
c1 = _mm_and_si128(c1, _mm_set1_epi16(0x3f80));
// convert to 32-bit and write (assuming the pointer is aligned)
__m128i* result = (__m128i*)f;
result[0] = _mm_unpacklo_epi16(_mm_setzero_si128(), c0);
result[1] = _mm_unpackhi_epi16(_mm_setzero_si128(), c0);
result[2] = _mm_unpacklo_epi16(_mm_setzero_si128(), c1);
result[3] = _mm_unpackhi_epi16(_mm_setzero_si128(), c1);

By switching to floats you can auto-vectorize the loop in GCC and not worry about intrinsics. The following code will do what you want and auto-vectorizes.

void foo(float *f, float*c, float c_thresh, const int n) {
    for (int i = 0; i < n; ++i) {
        f[i] = (float)(c[i] >= c_thresh);
    }
}

Compiled with

g++  -O3 -Wall  -pedantic -march=native main.cpp -ftree-vectorizer-verbose=1 

You can see the results and edit/compile the code yourself at coliru. However, MSVC2013 did not vectorize the loop.

What about:

f[i] = (c[i] >= c_thresh);

At least this removes the conditional.

AVX version:

void floatSelect(float* f, const char* c, size_t n, char c_thresh) {
    for (size_t i = 0; i < n; ++i) {
        if (c[i] < c_thresh) f[i] = 0.0f;
        else f[i] = 1.0f;
    }
}

void vecFloatSelect(float* f, const char* c, size_t n, char c_thresh) {
    const auto thresh = _mm_set1_epi8(c_thresh);
    const auto zeros = _mm256_setzero_ps();
    const auto ones = _mm256_set1_ps(1.0f);
    const auto shuffle0 = _mm_set_epi8(3, -1, -1, -1, 2, -1, -1, -1, 1, -1, -1, -1, 0, -1, -1, -1);
    const auto shuffle1 = _mm_set_epi8(7, -1, -1, -1, 6, -1, -1, -1, 5, -1, -1, -1, 4, -1, -1, -1);
    const auto shuffle2 = _mm_set_epi8(11, -1, -1, -1, 10, -1, -1, -1, 9, -1, -1, -1, 8, -1, -1, -1);
    const auto shuffle3 = _mm_set_epi8(15, -1, -1, -1, 14, -1, -1, -1, 13, -1, -1, -1, 12, -1, -1, -1);

    const size_t nVec = (n / 16) * 16;
    for (size_t i = 0; i < nVec; i += 16) {
        const auto chars = _mm_loadu_si128(reinterpret_cast<const __m128i*>(c + i));
        const auto mask = _mm_cmplt_epi8(chars, thresh);
        const auto floatMask0 = _mm_shuffle_epi8(mask, shuffle0);
        const auto floatMask1 = _mm_shuffle_epi8(mask, shuffle1);
        const auto floatMask2 = _mm_shuffle_epi8(mask, shuffle2);
        const auto floatMask3 = _mm_shuffle_epi8(mask, shuffle3);
        const auto floatMask01 = _mm256_set_m128i(floatMask1, floatMask0);
        const auto floatMask23 = _mm256_set_m128i(floatMask3, floatMask2);
        const auto floats0 = _mm256_blendv_ps(ones, zeros, _mm256_castsi256_ps(floatMask01));
        const auto floats1 = _mm256_blendv_ps(ones, zeros, _mm256_castsi256_ps(floatMask23));
        _mm256_storeu_ps(f + i, floats0);
        _mm256_storeu_ps(f + i + 8, floats1);
    }
    floatSelect(f + nVec, c + nVec, n % 16, c_thresh);
}

Converting to

f[i] = (float)(c[i] >= c_thresh);

- will also auto-vectorizable with Intel Compiler (mentioned by others to be true for gcc as well)

In case you need to auto-vectorize some branchy loop in general, - you could also try #pragma ivdep or pragma simd (the last one is a part of Intel Cilk Plus and OpenMP 4.0 standards). These pragmas auto-vectorize given code in portable way for SSE, AVX and future vector extensions (like AVX512). These pragmas are supported by Intel Compiler (all known versions), Cray and PGI compilers (ivdep only), probably upcoming GCC4.9 release and partially supported by MSVC (ivdep only) starting from VS2012.

For given example I didn't change anything (kept if and char*), just added pragma ivdep:

void foo(float *f, char*c, char c_thresh, const int n) {
    #pragma ivdep
    for ( int i = 0; i < n; ++i )
    {
        if ( c[i] < c_thresh ) f[i] = 0.0f;
        else                   f[i] = 1.0f;
    }
}

On my Core i5 with no AVX support (SSE3 only), for n = 32K (32000000), having c[i] generated randomly and using c_thresh equal to 0 (we use signed char), given code provides about ~5x speed-up due to vectorization with ICL.

Full test (with additional test case correctness check) is available here (it's coliru, i.e. gcc4.8 only, no ICL/Cray; that's why it doesn't vectorize in coliru env).

It should be possible to do further performance optimization by dealing with more pre-fetching, alignment and type conversions pragmas/optimizations. Also adding restrict keyword (or restrict depending on compiler used) may be used instead of ivdep/simd for given simple case, while for more general cases - pragmas simd/ivdep are most powerful.

Note: in fact #pragma ivdep "instructs the compiler to ignore assumed cross-iterational dependencies" ( roughly speaking those who leads to data races if you parallelize the same loop). Compilers are very conservative in these assumptions by well-known reasons. In given case there is obviously no write-after-read or read-after-write dependencies. If needed, one could validate presence of such dependencies at least on given workload with dynamic tools like Advisor XE Correctness analysis, like btw shown in my comments below.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top