Optimizing bitset operation

Question

Here's a very UNSAFE solution that you should NOT use:

assert (sizeof(output) == output.size() / 8);
assert ((output.size() / 8) == 0);

int offset = 0;
for (int i = 3; i >= 0; --i)
{
    auto const sz = input[i].size();

    assert ((sz % 8) == 0);
    assert (offset + (sz / 8) <= output.size());

    memcpy (((char*)&(output)) + offset, (char*)&(input[i]), sz / 8);
    offset += sz / 8;
}

Basically, this tries to be safe and make sure there is nothing else inside a bitset except for the bits (no padding, alignments, maybe size or even compressed bits.) And then copies them as a whole.

There is nothing in the standard (AFAIK) that guarantees that this will work. It may not even work for existing implementations, but I believe that it "should" work for a straightforward std::bitset implementation.

It might be possible to speed up the copy part even more. Since you know the size of your data, and it is very small, you can directly write SSE or even AVX intrinsics that will copy those 512 bits for the source address to the destination address.

Here are three more things to try:

If you are absolutely sure that your sizes remain constant (at e.g. 2048 and 512 bits,) use the constant values and drop the arithmetic and assertions. This might help depending on how your compiler treats memcpy (some compilers optimize the hell out of it under certain circumstances, e.g. if the sizes are constant and a multiple of word size, etc.)
Make sure your bit buffers are allocated at addresses that are multiples of cache line size (e.g. 64 bytes.) This is to make sure you don't touch more cache lines than necessary.)

You can try and help the memory "prefetcher" by touching the next input buffer on each iteration. For example:

char * output_ptr = (char *)&output;
char * input_ptrs [4] = {(char*)&(input[0]), (char*)&(input[1]), ...};
volatile char dummy = 0;

dummy += input_ptrs[2][0];                    // prefetch the next one
memcpy (output_ptr +   0, input_ptrs[3], 64); // copy

dummy += input_ptrs[1][0];                    // prefetch the next one
memcpy (output_ptr +  64, input_ptrs[2], 64); // copy

dummy += input_ptrs[0][0];                    // prefetch the next one
memcpy (output_ptr + 128, input_ptrs[1], 64); // copy

memcpy (output_ptr + 192, input_ptrs[0], 64); // copy