What's happening here is that you're being bitten by lazy allocation of virtual memory. If you change your code to this:
// 128-bit SIMD Bitwise AND
memcpy(dest, key1, minlen);
start_time = microtime();
simd_bitwise_and(dest, key2, minlen);
end_time = microtime();
printf("SIMD Elapsed : %8.6fs\n", (end_time - start_time));
assert(0x03 == dest[128]);
// 4xWORD Bitwise AND
memcpy(dest, key1, minlen);
start_time = microtime();
word_bitwise_and(dest, key2, minlen);
end_time = microtime();
printf("Scalar Elapsed: %8.6fs\n", (end_time - start_time));
assert(0x03 == dest[128]);
// 128-bit SIMD Bitwise AND
memcpy(dest, key1, minlen);
start_time = microtime();
simd_bitwise_and(dest, key2, minlen);
end_time = microtime();
printf("SIMD Elapsed : %8.6fs\n", (end_time - start_time));
assert(0x03 == dest[128]);
// 4xWORD Bitwise AND
memcpy(dest, key1, minlen);
start_time = microtime();
word_bitwise_and(dest, key2, minlen);
end_time = microtime();
printf("Scalar Elapsed: %8.6fs\n", (end_time - start_time));
assert(0x03 == dest[128]);
you should see results something like this:
$ ./bitwise-and
SIMD Elapsed : 630061.000000s
Scalar Elapsed: 228156.000000s
SIMD Elapsed : 182645.000000s
Scalar Elapsed: 202697.000000s
$
Explanation: the first time you iterate through your large memory allocations you are generating page faults, as previously unused pages get wired in. This gives an artificially high time for the first benchmark, which happens to be the SIMD benchmark. On the second and subsequent benchmarks the pages are all wired in and you get a more accurate benchmark, and as expected the SIMD routine is slightly faster than the scalar routine. The difference is not as large as might be expected, as you are executing only one ALU instruction for every 2 loads + 1 store, so performance is limited by DRAM bandwidth rather than computational efficiency.
As a general rule when writing benchmarking code: always call the benchmark routine at least once prior to any actual timing measurements, so that all memory allocations are properly wired in. After that run the benchmark routine a number of times in a loop and ignore any outliers.