It looks to me like it's a cache aliasing issue.
The test case is quite clever, and correctly loads everything into cache before timing it. It looks like everything fits in cache - though simulated, I've verified this by looking at the output of valgrind's cachegrind tool, and as one would expect in such a small test case, there are no significant cache misses:
valgrind --tool=cachegrind --I1=32768,8,64 --D1=32768,8,64 /tmp/so
==11130== Cachegrind, a cache and branch-prediction profiler
==11130== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==11130== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==11130== Command: /tmp/so
==11130==
--11130-- warning: L3 cache found, using its data for the LL simulation.
a: 6.692648s wall, 6.670000s user + 0.000000s system = 6.670000s CPU (99.7%)
b: 7.306552s wall, 7.280000s user + 0.000000s system = 7.280000s CPU (99.6%)
==11130==
==11130== I refs: 2,484,996,374
==11130== I1 misses: 1,843
==11130== LLi misses: 1,694
==11130== I1 miss rate: 0.00%
==11130== LLi miss rate: 0.00%
==11130==
==11130== D refs: 537,530,151 (470,253,428 rd + 67,276,723 wr)
==11130== D1 misses: 14,477 ( 12,433 rd + 2,044 wr)
==11130== LLd misses: 8,336 ( 6,817 rd + 1,519 wr)
==11130== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==11130== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==11130==
==11130== LL refs: 16,320 ( 14,276 rd + 2,044 wr)
==11130== LL misses: 10,030 ( 8,511 rd + 1,519 wr)
==11130== LL miss rate: 0.0% ( 0.0% + 0.0% )
I picked a 32k, 8 way associative cache with a 64 byte cache line size to match common Intel CPUs, and saw the same discrepancy between the a and b function repeatedly.
Running on an imaginary machine with a 32k, 128 way associative cache with the same cache line size though, that difference all but goes away:
valgrind --tool=cachegrind --I1=32768,128,64 --D1=32768,128,64 /tmp/so
==11135== Cachegrind, a cache and branch-prediction profiler
==11135== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==11135== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==11135== Command: /tmp/so
==11135==
--11135-- warning: L3 cache found, using its data for the LL simulation.
a: 6.754838s wall, 6.730000s user + 0.010000s system = 6.740000s CPU (99.8%)
b: 6.827246s wall, 6.800000s user + 0.000000s system = 6.800000s CPU (99.6%)
==11135==
==11135== I refs: 2,484,996,642
==11135== I1 misses: 1,816
==11135== LLi misses: 1,718
==11135== I1 miss rate: 0.00%
==11135== LLi miss rate: 0.00%
==11135==
==11135== D refs: 537,530,207 (470,253,470 rd + 67,276,737 wr)
==11135== D1 misses: 14,297 ( 12,276 rd + 2,021 wr)
==11135== LLd misses: 8,336 ( 6,817 rd + 1,519 wr)
==11135== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==11135== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==11135==
==11135== LL refs: 16,113 ( 14,092 rd + 2,021 wr)
==11135== LL misses: 10,054 ( 8,535 rd + 1,519 wr)
==11135== LL miss rate: 0.0% ( 0.0% + 0.0% )
Since in an 8 way cache, there are fewer spaces where potentially aliasing functions can hide, you get the addressing equivalent of more hash collisions. With the machine that has different cache associativity, in this instance you luck out with where things get placed in the object file, and so though not a cache miss, you also don't have to do any work to resolve which cache line you actually need.
Edit: more on cache associativity: http://en.wikipedia.org/wiki/CPU_cache#Associativity
Another edit: I've confirmed this with hardware event monitoring through the perf
tool.
I modified the source to call only a() or b() depending on whether there was a command line argument present. The timings are the same as in the original test case.
sudo perf record -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-loads,iTLB-load-misses /tmp/so
a: 6.317755s wall, 6.300000s user + 0.000000s system = 6.300000s CPU (99.7%)
sudo perf report
4K dTLB-loads
97 dTLB-load-misses
4K dTLB-stores
7 dTLB-store-misses
479 iTLB-loads
142 iTLB-load-misses
whereas
sudo perf record -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-loads,iTLB-load-misses /tmp/so foobar
b: 4.854249s wall, 4.840000s user + 0.000000s system = 4.840000s CPU (99.7%)
sudo perf report
3K dTLB-loads
87 dTLB-load-misses
3K dTLB-stores
19 dTLB-store-misses
259 iTLB-loads
93 iTLB-load-misses
Showing that b has less TLB action, and so the cache doesn't have to be evicted. Given that the functionality between the two is otherwise identical, it can only be explained by aliasing.