The problem is that your "source" (the array x
) is not aligned to the size that the SSE instructions require.
You can fix this with using the "unaligned" load instruction, or you can fix it by using the __declspec(align(n))
, e.g:
float __declspec(align(16)) x[N];
float __declspec(align(16)) y[N];
Now your x
and y
arrays are aligned to 16 bytes, and are viable for access [on indices that are multiples of 4, of course] from SSE instructions. Note that unaligned access is not allowed for general SSE instructions that take memory arguments, so for example a _mm_max_ps
requires that the second argument (in Intel order, first in AT&T order) is an aligned memory location.