That's an interesting observation. I was able to reproduce your results. I manged to improve your SSE code speed quite a bit by unrolling the loop (see the code below). Now for SSE dataLen=2864
is clearly faster and for the smaller values it's nearlly as fast as AVX. For larger values it's ever faster still. This is due to the carried loop dependency in your SSE code (i.e. unrolling the loop increases the instruction level parallelism (ILP)). I did not try unrolling any further. Unrolling the AVX code did not help.
I don't have a clear answer to your question though. My hunch is that it's related to the ILP and the fact that AVX processors such as Sandy Bridge can only load two 128-bit words (SSE width) simultaneously and not two 256-bit words. So in the SSE code it can do one SSE addition, one SSE multiplication, two SSE loads, and one SSE store simultaneously. For AVX it can do one AVX load (through two 128-bit loads on ports 2 and 3), one AVX multiplication, one AVX addition, and one 128bit store (half the AVX width) simultaneous. In other words although with AVX the multiplication and additions do twice as much work as SSE the loads and stores are still 128bit wide. Maybe this leads to lower ILP with AVX compared to SSE sometimes with code dominated by loads and stores?
For more info on the ports and ILP see this Haswell, Sandy Bridge, Nehalem ports compared.
__m128 p1, p2, p3, p1_v2, p2_v2, p3_v2;
for(int j=0; j<N; j++)
for(int i=0; i<dataLen; i+=8)
{
p1 = _mm_load_ps(&buf1[i]);
p1_v2 = _mm_load_ps(&buf1[i+4]);
p2 = _mm_load_ps(&buf2[i]);
p2_v2 = _mm_load_ps(&buf2[i+4]);
p3 = _mm_load_ps(&buf3[i]);
p3_v2 = _mm_load_ps(&buf3[i+4]);
p3 = _mm_add_ps(_mm_mul_ps(p1, p2), p3);
p3_v2 = _mm_add_ps(_mm_mul_ps(p1_v2, p2_v2), p3_v2);
_mm_store_ps(&buf3[i], p3);
_mm_store_ps(&buf3[i+4], p3_v2);
}