Question

I am looking to optimize this piece of code. BitSetPattern is of size 512 while BitSetOut is 2048 bits long.

Basically what this code does, it takes 4 BitSetPatterns and creates a BitSetOut by concatinating them. Copying by one bit takes too much time.

I figured an OR along with a SHIFT would solve the case, but the operators only take bitsets of the same length. There is also no way to initialize a bitset with a smaller one and fill the remaining bits with zeroes.

unsigned outputIter = 0;
BitSetPattern output(0);

for (int i = 3; i >= 0; i--)
{
    BitSetOut currentInput = this->input[ i ]->getOutput();
    for (unsigned j = 0; j < currentInput.size(); j++)
    {
        output[ outputIter ] = currentInput[ j ];
        outputIter++;
    }
}
return output;

I have already profiled the code, and this is the method that takes too much time.

Was it helpful?

Solution

Here's a very UNSAFE solution that you should NOT use:

assert (sizeof(output) == output.size() / 8);
assert ((output.size() / 8) == 0);

int offset = 0;
for (int i = 3; i >= 0; --i)
{
    auto const sz = input[i].size();

    assert ((sz % 8) == 0);
    assert (offset + (sz / 8) <= output.size());

    memcpy (((char*)&(output)) + offset, (char*)&(input[i]), sz / 8);
    offset += sz / 8;
}

Basically, this tries to be safe and make sure there is nothing else inside a bitset except for the bits (no padding, alignments, maybe size or even compressed bits.) And then copies them as a whole.

There is nothing in the standard (AFAIK) that guarantees that this will work. It may not even work for existing implementations, but I believe that it "should" work for a straightforward std::bitset implementation.

It might be possible to speed up the copy part even more. Since you know the size of your data, and it is very small, you can directly write SSE or even AVX intrinsics that will copy those 512 bits for the source address to the destination address.

Here are three more things to try:

  1. If you are absolutely sure that your sizes remain constant (at e.g. 2048 and 512 bits,) use the constant values and drop the arithmetic and assertions. This might help depending on how your compiler treats memcpy (some compilers optimize the hell out of it under certain circumstances, e.g. if the sizes are constant and a multiple of word size, etc.)

  2. Make sure your bit buffers are allocated at addresses that are multiples of cache line size (e.g. 64 bytes.) This is to make sure you don't touch more cache lines than necessary.)

  3. You can try and help the memory "prefetcher" by touching the next input buffer on each iteration. For example:

    char * output_ptr = (char *)&output;
    char * input_ptrs [4] = {(char*)&(input[0]), (char*)&(input[1]), ...};
    volatile char dummy = 0;
    
    dummy += input_ptrs[2][0];                    // prefetch the next one
    memcpy (output_ptr +   0, input_ptrs[3], 64); // copy
    
    dummy += input_ptrs[1][0];                    // prefetch the next one
    memcpy (output_ptr +  64, input_ptrs[2], 64); // copy
    
    dummy += input_ptrs[0][0];                    // prefetch the next one
    memcpy (output_ptr + 128, input_ptrs[1], 64); // copy
    
    memcpy (output_ptr + 192, input_ptrs[0], 64); // copy
    
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top