Performance improvement for mirroring bit matrix around the diagonal

Question 1

Operating on separate bits is very slow, and creating 2 bit arrays and copy data back and forth creates further overheads

The simplest obvious solution is just extracting the bits and combining them again. You can do it with a loop but since it uses both left and right shift at the same time, you need a function to handle negative shift amount. As a result here I unrolled it for easier understanding and more speed

out[0] = (byte)(((rxData[0] & 0x80) >> 0) | ((rxData[1] & 0x80) >> 1) | ((rxData[2] & 0x80) >> 2) | ((rxData[3] & 0x80) >> 3) |
                ((rxData[4] & 0x80) >> 4) | ((rxData[5] & 0x80) >> 5) | ((rxData[6] & 0x80) >> 6) | ((rxData[7] & 0x80) >> 7));
            
out[1] = (byte)(((rxData[0] & 0x40) << 1) | ((rxData[1] & 0x40) >> 0) | ((rxData[2] & 0x40) >> 1) | ((rxData[3] & 0x40) >> 2) |
                ((rxData[4] & 0x40) >> 3) | ((rxData[5] & 0x40) >> 4) | ((rxData[6] & 0x40) >> 5) | ((rxData[7] & 0x40) >> 6));
            
out[2] = (byte)(((rxData[0] & 0x20) << 2) | ((rxData[1] & 0x20) << 1) | ((rxData[2] & 0x20) >> 0) | ((rxData[3] & 0x20) >> 1) |
                ((rxData[4] & 0x20) >> 2) | ((rxData[5] & 0x20) >> 3) | ((rxData[6] & 0x20) >> 4) | ((rxData[7] & 0x20) >> 5));
            
out[3] = (byte)(((rxData[0] & 0x10) << 3) | ((rxData[1] & 0x10) << 2) | ((rxData[2] & 0x10) << 1) | ((rxData[3] & 0x10) >> 0) |
                ((rxData[4] & 0x10) >> 1) | ((rxData[5] & 0x10) >> 2) | ((rxData[6] & 0x10) >> 3) | ((rxData[7] & 0x10) >> 4));
            
out[4] = (byte)(((rxData[0] & 0x08) << 4) | ((rxData[1] & 0x08) << 3) | ((rxData[2] & 0x08) << 2) | ((rxData[3] & 0x08) << 1) |
                ((rxData[4] & 0x08) >> 0) | ((rxData[5] & 0x08) >> 1) | ((rxData[6] & 0x08) >> 2) | ((rxData[7] & 0x08) >> 3));
            
out[5] = (byte)(((rxData[0] & 0x04) << 5) | ((rxData[1] & 0x04) << 4) | ((rxData[2] & 0x04) << 3) | ((rxData[3] & 0x04) << 2) |
                ((rxData[4] & 0x04) << 1) | ((rxData[5] & 0x04) >> 0) | ((rxData[6] & 0x04) >> 1) | ((rxData[7] & 0x04) >> 2));
            
out[6] = (byte)(((rxData[0] & 0x02) << 6) | ((rxData[1] & 0x02) << 5) | ((rxData[2] & 0x02) << 4) | ((rxData[3] & 0x02) << 3) |
                ((rxData[4] & 0x02) << 2) | ((rxData[5] & 0x02) << 1) | ((rxData[6] & 0x02) >> 0) | ((rxData[7] & 0x02) >> 1));

out[7] = (byte)(((rxData[0] & 0x01) << 7) | ((rxData[1] & 0x01) << 6) | ((rxData[2] & 0x01) << 5) | ((rxData[3] & 0x01) << 4) |
                ((rxData[4] & 0x01) << 3) | ((rxData[5] & 0x01) << 2) | ((rxData[6] & 0x01) << 1) | ((rxData[7] & 0x01) >> 0));

Obviously this is still very slow because it operates byte-by-byte. An optimal solution would operate on multiple bytes at once, for example with SIMD and/or multithreading. Especially since you're doing that for lots of data, .NET SIMD intrinsics would be extremely useful

Question 2

You will probably find that BitVector performs a good deal better than BitArray.

BitVector32 is more efficient than BitArray for Boolean values and small integers that are used internally. A BitArray can grow indefinitely as needed, but it has the memory and performance overhead that a class instance requires. In contrast, a BitVector32 uses only 32 bits.

http://msdn.microsoft.com/en-us/library/system.collections.specialized.bitvector32.aspx

If you initialize an array of BitVector32 and operate on those, it should be faster than operating on BitArray as you do now.

You may also get a performance boost if you use one thread to perform the mirroring and a second thread to perform the analysis of consecutive reads. The Task Parallel Library Dataflow provides a nice framework for that type of solution. You could have one Source Block to acquire the data buffer, one Transform Block to perform the mirroring, and one Target Block to perform the data processing.

Question 3

This is actually the same as the get column in a bitboard problem so it can be solved even more efficiently by considering the byte array as a 64-bit integer

byte get_byte(ulong matrix, uint col)
{
    const ulong column_mask = 0x8080808080808080ull;
    const ulong magic       = 0x2040810204081ull;
    
    ulong column = ((matrix << col) & column_mask) * magic;
    return (byte)(column >> 56);
}

// Actually the below step is not needed. You can read rxData directly into the `ulong`
// variable instead of a bit array. Remember to CHANGE THE ENDIANNESS if necessary
ulong matrix = (rxData[7] << 56) | (rxData[6] << 48) | (rxData[5] << 40) | (rxData[4] << 32)
             | (rxData[3] << 24) | (rxData[2] << 16) | (rxData[1] <<  8) | rxData[0];

for (int i = 0; i < 8; i++)
    data[i] = get_byte(matrix, i);

In newer x86 CPUs you can use the PDEP instruction in the BMI2 instruction set. I'm not sure if there's any corresponding intrinsic in C#. If the intrinsic doesn't exist then you have to use native code like this

data[i] = _pext_u64(matrix, column_mask >> col);

Update:

The instruction has been added to .NET Core 3.0 as ParallelBitDeposit() intrinsic so it's now much easier and faster to do that from C#

ulong matrix = BitConverter.ToUInt64(rxData, 0);
for (int i = 0; i < 8; i++)
    data[i] = Bmi2.X64.ParallelBitDeposit(matrix, 0x8080808080808080UL >> i);

There's also ParallelBitExtract() for the inverse PEXT instruction