It seems that there is no store-to-load blocking with 32-byte loads on Sandy Bridge after all. Consider the following modified loop body:
#ifdef TEST_AVX
__asm__("vmovapd %%ymm12, (%0)\n\t"
"vmovapd (%0), %%ymm13\n\t"
:
:"r"(tempa));
#else
__asm__("movapd %%xmm12, (%0)\n\t"
"movapd (%0), %%xmm13\n\t"
:
:"r"(tempa));
#endif
The change is the destination register - I now use two different registers for load and store so that there is no dependence between the two instructions and subsequent iterations. In this case the SSE version takes 1 cycle per iteration, while the AVX version takes 2 cycles. This is consistent with the fact that SB has a capacity of two 16-bytes loads per cycle. Hence, loading 32 bytes takes two cycles - no stall.
The problem must be connected with the counter logic. Clearly, in the AVX case the LD_BLOCKS.STORE_FORWARD
is incremented, although no block takes place. This should be taken into account while analyzing performance using the counters.