Question

I would like some help optimizing the most computationally intensive function of my program. Currently, I am finding that the basic (non-SSE) version is significantly faster (up to 3x). I would thus request your help in rectifying this.

The function looks for subsets in unsigned integer vectors, and reports if they exist or not. For your convenience I have included the relevant code snippets only.

First up is the basic variant. It checks to see if blocks_ is a proper subset of x.blocks_. (Not exactly equal.) These are bitmaps, aka bit vectors or bitsets.

//Check for self comparison
    if (this == &x)
            return false;

//A subset is equal to or smaller.
    if (no_bits_ > x.no_bits_)
            return false;

    int i;

    bool equal = false;

//Pointers should not change.  
    const unsigned int *tptr = blocks_;
    const unsigned int *xptr = x.blocks_;

    
    for (i = 0; i < no_blocks_; i++, tptr++, xptr++) {
            if ((*tptr & *xptr) != *tptr)
                    return false;
            if (*tptr != *xptr)
                    equal = true;
    }

    return equal;

Then comes the SSE variant, which alas does not perform according to my expectations. Both of these snippets should look for the same things.

    //starting pointers.        
    const __m128i* start = (__m128i*)&blocks_;
    const __m128i* xstart = (__m128i*)&x.blocks_;
    
    __m128i block;
    __m128i xblock;

    //Unsigned ints are 32 bits, meaning 4 can fit in a register.
    for (i = 0; i < no_blocks_; i+=4) {

            block = _mm_load_si128(start + i);
            xblock = _mm_load_si128(xstart + i);

                    //Equivalent to  (block & xblock) != block
                    if (_mm_movemask_epi8(_mm_cmpeq_epi32(_mm_and_si128(block, xblock), block)) != 0xffff)
                            return false;

                    //Equivalent to block != xblock
                    if (_mm_movemask_epi8(_mm_cmpeq_epi32(block, xblock)) != 0xffff)
                            equal = true;
    }
    return equal;

Do you have any suggestions as to how I may improve upon the performance of the SSE version? Am I doing something wrong? Or is this a case where optimization should be done elsewhere?

I have not yet added in the leftover calculations for no_blocks_ % 4 != 0, but there is little purpose in doing so until the performance increases, and it would only clutter up the code at this point.

Was it helpful?

Solution

There are three possibilities I see here.

First, your data might not suit wide comparisons. If there's a high chance that (*tptr & *xptr) != *tptr within the first few blocks, the plain C++ version will almost certainly always be faster. In that instance, your SSE will run through more code & data to accomplish the same thing.

Second, your SSE code may be incorrect. It's not totally clear here. If no_blocks_ is identical between the two samples, then start + i is probably having the unwanted behavior of indexing into 128-bit elements, not 32-bit as the first sample.

Third, SSE really likes it when instructions can be pipelined, and this is such a short loop that you might not be getting that. You can reduce branching significantly here by processing more than one SSE block at once.

Here's a quick untested shot at processing 2 SSE blocks at once. Note I've removed the block != xblock branch entirely by keeping the state outside of the loop and only testing at the end. In total, this moves things from 1.3 branches per int to 0.25.

bool equal(unsigned const *a, unsigned const *b, unsigned count)
{
    __m128i eq1 = _mm_setzero_si128();
    __m128i eq2 = _mm_setzero_si128();

    for (unsigned i = 0; i != count; i += 8)
    {
        __m128i xa1 = _mm_load_si128((__m128i const*)(a + i));
        __m128i xb1 = _mm_load_si128((__m128i const*)(b + i));

        eq1 = _mm_or_si128(eq1, _mm_xor_si128(xa1, xb1));
        xa1 = _mm_cmpeq_epi32(xa1, _mm_and_si128(xa1, xb1));

        __m128i xa2 = _mm_load_si128((__m128i const*)(a + i + 4));
        __m128i xb2 = _mm_load_si128((__m128i const*)(b + i + 4));

        eq2 = _mm_or_si128(eq2, _mm_xor_si128(xa2, xb2));
        xa2 = _mm_cmpeq_epi32(xa2, _mm_and_si128(xa2, xb2));

        if (_mm_movemask_epi8(_mm_packs_epi32(xa1, xa2)) != 0xFFFF)
            return false;
    }

    return _mm_movemask_epi8(_mm_or_si128(eq1, eq2)) != 0;
}

If you've got enough data and a low probability of failure within the first few SSE blocks, something like this should be at least somewhat faster than your SSE.

OTHER TIPS

I seems that your problem is a memory bandwidth bounded problem: Asymptotic you need about 2 operation for processing a pair of integer in memory scanned. There is not enough arithmetic complexity to get advantage of use more arithmetic throughput from CPU SSE instructions. In fact you CPU pass lot of time waiting for data transfers. But using SSE instructions in your case induce a overall of instructions and resulting code is not well optimized by compiler.

There are some alternatives strategies to improve performance in bandwidth bounded problem:

  • Multi-thread hide access memory by concurrent arithmetic operations in hyper-threading context.
  • Fine tuning of size of data load at time improve memory bandwidth.
  • Improve the pipe-line continuity by adding supplementary independents operations in a loop (scan two different sets of data at each step in your "for" loop)
  • Keep more data in cache or in registers (some iterations of your code may be need the same set of data many times)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top