The generated unlock code is different. The CST memory model (with g++ 4.9.0) generates:
movb %sil, spinLock(%rip)
mfence
for the unlock. The acquire/release generates:
movb %sil, spinLock(%rip)
The lock code is the same. Someone else will have say something about why it's better with the fence, but if I had to guess, I would guess that it reduces bus/cache-coherence contention, possibly by reducing interference on the bus. Sometimes stricter is more orderly, and thus faster.
ADDENDUM: According to this, mfence costs around 100 cycles. So maybe you are reducing bus contention, because when a thread finishes the loop body, it pauses a bit before trying to reacquire the lock, letting the other thread finish. You could try to do the same thing by putting in a short delay loop after the unlock, though you'd have to make sure that it didn't get optimized out.
ADDENDUM2: It does seem to be caused by bus interference/contention caused by looping around too fast. I added a short delay loop like:
spinLock.unlock();
for (int i = 0; i < 5; i++) {
j = i * 3.5 + val;
}
Now, the acquire/release performs the same.