How to bitwise operate on memory block (C++)

Question

In general, there is no magical way faster than a for loop. However, you can make it easier for the compiler to optimize the loop by keeping a few things in mind:

Load the largest available integer type into memory at a time. However, you need to be careful if your buffer has a length which does not divide evenly by the size of that integer type.
If possible, operate on multiple values in one loop iteration - this should make vectorization much simpler for the compiler. Again, you need to be careful about the buffer length.
If the loop is to be run many times on short sections of code, use a loop index that counts downwards to zero rather than upwards, and subtract it from the array length - this makes it easier for the CPU's branch predictor to figure out what's going on.
You can use explicit vector extensions provided by the compiler, but this will make your code less portable.
Ultimately, you can write the loop in assembly and use vector instructions provided by your CPU, but this is completely unportable.
[edit] Additionally, you can use OpenMP or a similar API to divide the loop between multiple threads, but this will only cause an improvement if you are performing the operation on a very large amount of memory.

C99 example of xoring memory with a constant byte, assuming long long is 128-bit, the start of the buffer is aligned to 16 bytes, and without considering point 3. Bitwise operations on two memory buffers are very similar.

size_t len = ...;
char *buffer = ...;

size_t const loadd_per_i = 4
size_t iters = len / sizeof(long long) / loads_per_i;

long long *ptr = (long long *) buffer;
long long xorvalue = 0x5e5e5e5e5e5e5e5e5e5e5e5e5e5e5e5eLL;

// run in multiple threads if there are more than 4 MB to xor
#pragma omp parallel for if(iters > 65536)
for (size_t i = 0; i < iters; ++i) {
    size_t j = loads_per_i*i;
    ptr[j  ] ^= xorvalue;
    ptr[j+1] ^= xorvalue;
    ptr[j+2] ^= xorvalue;
    ptr[j+3] ^= xorvalue;
}

// finish long longs which don't align to 4
for (size_t i = iters * loads_per_i; i < len / sizeof(long long); ++i) {
    ptr[i] ^= xorvalue;
}

// finish bytes which don't align to long
for (size_t i = (len / sizeof(long long)) * sizeof(long long); i < len; ++i) {
    buffer[i] ^= xorvalue;
}