Question

I am trying to compare two rows of pixels.

A pixel is defined as a struct containing 4 float values (RGBA).

The reason I am not using memcmp is because I need to return the position of the 1st different pixel, which memcmp does not do.

My first implementation uses SSE intrinsics, and is ~30% slower than memcmp:

inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
    for (int i = 0; i < count; i++)
    {
        __m128 x = _mm_load_ps((float*)(a + i));
        __m128 y = _mm_load_ps((float*)(b + i));
        __m128 cmp = _mm_cmpeq_ps(x, y);
        if (_mm_movemask_ps(cmp) != 15) return i;
    }
    return -1;
}

I then found that treating the values as integers instead of floats sped things up a bit, and is now only ~20% slower than memcmp.

inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
    for (int i = 0; i < count; i++)
    {
        __m128i x = _mm_load_si128((__m128i*)(a + i));
        __m128i y = _mm_load_si128((__m128i*)(b + i));
        __m128i cmp = _mm_cmpeq_epi32(x, y);
        if (_mm_movemask_epi8(cmp) != 0xffff) return i; 
    }
    return -1;
}

From what I've read on other questions, the MS implementation of memcmp is also implemented using SSE. My question is what other tricks does the MS implementation have up it's sleeve that I don't? How is it still faster even though it does a byte-by-byte comparison?

Is alignment an issue? If the pixel contains 4 floats, won't an array of pixels already be allocated on a 16 byte boundary?

I am compiling with /o2 and all the optimization flags.

Was it helpful?

Solution

I have written strcmp/memcmp optimizations with SSE (and MMX/3DNow!), and the first step is to ensure that the arrays are as aligned as possible - you may find that you have to do the first and/or last bytes "one at a time".

If you can align the data before it gets to the loop [if your code does the allocation], then that's ideal.

The second part is to unroll the loop, so you don't get so many "if loop isn't at the end, jump back to beginning of loop" - assuming the loop is quite long.

You may find that preloading the next data of the input before doing the "do we leave now" condition helps too.

Edit: The last paragraph may need an example. This code assumes an unrolled loop of at least two:

 __m128i x = _mm_load_si128((__m128i*)(a));
 __m128i y = _mm_load_si128((__m128i*)(b));

 for(int i = 0; i < count; i+=2)
 {
    __m128i cmp = _mm_cmpeq_epi32(x, y);

    __m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
    __m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));

    if (_mm_movemask_epi8(cmp) != 0xffff) return i; 
    cmp = _mm_cmpeq_epi32(x1, y1);
    __m128i x = _mm_load_si128((__m128i*)(a + i + 2));
    __m128i y = _mm_load_si128((__m128i*)(b + i + 2));
    if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1; 
}

Roughly something like that.

OTHER TIPS

You might want to check this memcmp SSE implementation, specifically the __sse_memcmp function, it starts with some sanity checks and then checks if the pointers are aligned or not:

aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );

If they are not aligned it compares the pointers byte by byte until the start of an aligned address:

 while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
   if(*a++ != *b++) return -1;
   --len;
}

And then compares the remaining memory with SSE instructions similar to your code:

 if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....

I cannot help you directly because I'm using Mac, but there's an easy way to figure out what happens:

You just step into memcpy in the debug mode and switch to Disassembly view. As the memcpy is a simple little function, you will easily figure out all the implementation tricks.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top