performance of SSE and AVX when both Memory-band width limited

Question 1

That's an interesting observation. I was able to reproduce your results. I manged to improve your SSE code speed quite a bit by unrolling the loop (see the code below). Now for SSE dataLen=2864 is clearly faster and for the smaller values it's nearlly as fast as AVX. For larger values it's ever faster still. This is due to the carried loop dependency in your SSE code (i.e. unrolling the loop increases the instruction level parallelism (ILP)). I did not try unrolling any further. Unrolling the AVX code did not help.

I don't have a clear answer to your question though. My hunch is that it's related to the ILP and the fact that AVX processors such as Sandy Bridge can only load two 128-bit words (SSE width) simultaneously and not two 256-bit words. So in the SSE code it can do one SSE addition, one SSE multiplication, two SSE loads, and one SSE store simultaneously. For AVX it can do one AVX load (through two 128-bit loads on ports 2 and 3), one AVX multiplication, one AVX addition, and one 128bit store (half the AVX width) simultaneous. In other words although with AVX the multiplication and additions do twice as much work as SSE the loads and stores are still 128bit wide. Maybe this leads to lower ILP with AVX compared to SSE sometimes with code dominated by loads and stores?

For more info on the ports and ILP see this Haswell, Sandy Bridge, Nehalem ports compared.

__m128 p1, p2, p3, p1_v2, p2_v2, p3_v2;
for(int j=0; j<N; j++)
    for(int i=0; i<dataLen; i+=8)
    {
        p1 = _mm_load_ps(&buf1[i]);
        p1_v2 = _mm_load_ps(&buf1[i+4]);
        p2 = _mm_load_ps(&buf2[i]);
        p2_v2 = _mm_load_ps(&buf2[i+4]);
        p3 = _mm_load_ps(&buf3[i]);
        p3_v2 = _mm_load_ps(&buf3[i+4]);
        p3 = _mm_add_ps(_mm_mul_ps(p1, p2), p3);
        p3_v2 = _mm_add_ps(_mm_mul_ps(p1_v2, p2_v2), p3_v2);
        _mm_store_ps(&buf3[i], p3);
        _mm_store_ps(&buf3[i+4], p3_v2);
    }

Question 2

I think it's flaws of Sandy Bdrige architecture's cache system. I could reproduce same result on Ivy Brdige CPU, but not on Haswell CPUs, but haswell has same problem on aceessing L3. I think it's big flaws to AVX. Intel should fix this problem on next stepping or next architecture.

N = 1000000
datalen = 2000
SSE time used: 280000 us,
AVX time used: 156000 us,

N = 1000000
datalen = 4000 <- it's still fast on Haswell using L2
SSE time used: 811000 us,
AVX time used: 702000 us,

N = 1000000
datalen = 6000
SSE time used: 1216000 us,
AVX time used: 1076000 us,

N = 1000000
datalen = 8000
SSE time used: 1622000 us,
AVX time used: 1466000 us,

N = 100000  <- reduced
datalen = 20000 <- fit in L2 : 256K / 23 = 21845.3
SSE time used: 405000 us,
AVX time used: 374000 us,

N = 100000  
datalen = 40000 <- need L3
SSE time used: 1185000 us,
AVX time used: 1263000 us,

N = 100000  
datalen = 80000
SSE time used: 2340000 us,
AVX time used: 2527000 us,