Why is this slower than memcmp

Question 1

I have written strcmp/memcmp optimizations with SSE (and MMX/3DNow!), and the first step is to ensure that the arrays are as aligned as possible - you may find that you have to do the first and/or last bytes "one at a time".

If you can align the data before it gets to the loop [if your code does the allocation], then that's ideal.

The second part is to unroll the loop, so you don't get so many "if loop isn't at the end, jump back to beginning of loop" - assuming the loop is quite long.

You may find that preloading the next data of the input before doing the "do we leave now" condition helps too.

Edit: The last paragraph may need an example. This code assumes an unrolled loop of at least two:

 __m128i x = _mm_load_si128((__m128i*)(a));
 __m128i y = _mm_load_si128((__m128i*)(b));

 for(int i = 0; i < count; i+=2)
 {
    __m128i cmp = _mm_cmpeq_epi32(x, y);

    __m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
    __m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));

    if (_mm_movemask_epi8(cmp) != 0xffff) return i; 
    cmp = _mm_cmpeq_epi32(x1, y1);
    __m128i x = _mm_load_si128((__m128i*)(a + i + 2));
    __m128i y = _mm_load_si128((__m128i*)(b + i + 2));
    if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1; 
}

Roughly something like that.

Question 2

You might want to check this memcmp SSE implementation, specifically the __sse_memcmp function, it starts with some sanity checks and then checks if the pointers are aligned or not:

aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );

If they are not aligned it compares the pointers byte by byte until the start of an aligned address:

 while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
   if(*a++ != *b++) return -1;
   --len;
}

And then compares the remaining memory with SSE instructions similar to your code:

 if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....

Question 3

I cannot help you directly because I'm using Mac, but there's an easy way to figure out what happens:

You just step into memcpy in the debug mode and switch to Disassembly view. As the memcpy is a simple little function, you will easily figure out all the implementation tricks.