Optimize check for a bit-vector being a proper subset of another?

Question 1

There are three possibilities I see here.

First, your data might not suit wide comparisons. If there's a high chance that (*tptr & *xptr) != *tptr within the first few blocks, the plain C++ version will almost certainly always be faster. In that instance, your SSE will run through more code & data to accomplish the same thing.

Second, your SSE code may be incorrect. It's not totally clear here. If no_blocks_ is identical between the two samples, then start + i is probably having the unwanted behavior of indexing into 128-bit elements, not 32-bit as the first sample.

Third, SSE really likes it when instructions can be pipelined, and this is such a short loop that you might not be getting that. You can reduce branching significantly here by processing more than one SSE block at once.

Here's a quick untested shot at processing 2 SSE blocks at once. Note I've removed the block != xblock branch entirely by keeping the state outside of the loop and only testing at the end. In total, this moves things from 1.3 branches per int to 0.25.

bool equal(unsigned const *a, unsigned const *b, unsigned count)
{
    __m128i eq1 = _mm_setzero_si128();
    __m128i eq2 = _mm_setzero_si128();

    for (unsigned i = 0; i != count; i += 8)
    {
        __m128i xa1 = _mm_load_si128((__m128i const*)(a + i));
        __m128i xb1 = _mm_load_si128((__m128i const*)(b + i));

        eq1 = _mm_or_si128(eq1, _mm_xor_si128(xa1, xb1));
        xa1 = _mm_cmpeq_epi32(xa1, _mm_and_si128(xa1, xb1));

        __m128i xa2 = _mm_load_si128((__m128i const*)(a + i + 4));
        __m128i xb2 = _mm_load_si128((__m128i const*)(b + i + 4));

        eq2 = _mm_or_si128(eq2, _mm_xor_si128(xa2, xb2));
        xa2 = _mm_cmpeq_epi32(xa2, _mm_and_si128(xa2, xb2));

        if (_mm_movemask_epi8(_mm_packs_epi32(xa1, xa2)) != 0xFFFF)
            return false;
    }

    return _mm_movemask_epi8(_mm_or_si128(eq1, eq2)) != 0;
}

If you've got enough data and a low probability of failure within the first few SSE blocks, something like this should be at least somewhat faster than your SSE.

Question 2

I seems that your problem is a memory bandwidth bounded problem: Asymptotic you need about 2 operation for processing a pair of integer in memory scanned. There is not enough arithmetic complexity to get advantage of use more arithmetic throughput from CPU SSE instructions. In fact you CPU pass lot of time waiting for data transfers. But using SSE instructions in your case induce a overall of instructions and resulting code is not well optimized by compiler.

There are some alternatives strategies to improve performance in bandwidth bounded problem:

Multi-thread hide access memory by concurrent arithmetic operations in hyper-threading context.
Fine tuning of size of data load at time improve memory bandwidth.
Improve the pipe-line continuity by adding supplementary independents operations in a loop (scan two different sets of data at each step in your "for" loop)
Keep more data in cache or in registers (some iterations of your code may be need the same set of data many times)