There are three possibilities I see here.
First, your data might not suit wide comparisons. If there's a high chance that (*tptr & *xptr) != *tptr
within the first few blocks, the plain C++ version will almost certainly always be faster. In that instance, your SSE will run through more code & data to accomplish the same thing.
Second, your SSE code may be incorrect. It's not totally clear here. If no_blocks_
is identical between the two samples, then start + i
is probably having the unwanted behavior of indexing into 128-bit elements, not 32-bit as the first sample.
Third, SSE really likes it when instructions can be pipelined, and this is such a short loop that you might not be getting that. You can reduce branching significantly here by processing more than one SSE block at once.
Here's a quick untested shot at processing 2 SSE blocks at once. Note I've removed the block != xblock
branch entirely by keeping the state outside of the loop and only testing at the end. In total, this moves things from 1.3 branches per int
to 0.25.
bool equal(unsigned const *a, unsigned const *b, unsigned count)
{
__m128i eq1 = _mm_setzero_si128();
__m128i eq2 = _mm_setzero_si128();
for (unsigned i = 0; i != count; i += 8)
{
__m128i xa1 = _mm_load_si128((__m128i const*)(a + i));
__m128i xb1 = _mm_load_si128((__m128i const*)(b + i));
eq1 = _mm_or_si128(eq1, _mm_xor_si128(xa1, xb1));
xa1 = _mm_cmpeq_epi32(xa1, _mm_and_si128(xa1, xb1));
__m128i xa2 = _mm_load_si128((__m128i const*)(a + i + 4));
__m128i xb2 = _mm_load_si128((__m128i const*)(b + i + 4));
eq2 = _mm_or_si128(eq2, _mm_xor_si128(xa2, xb2));
xa2 = _mm_cmpeq_epi32(xa2, _mm_and_si128(xa2, xb2));
if (_mm_movemask_epi8(_mm_packs_epi32(xa1, xa2)) != 0xFFFF)
return false;
}
return _mm_movemask_epi8(_mm_or_si128(eq1, eq2)) != 0;
}
If you've got enough data and a low probability of failure within the first few SSE blocks, something like this should be at least somewhat faster than your SSE.