SSE requires its memory access to be with 16-byte aligned addresses. If you're not reading from outside of the array, this is likely your problem.
Try using _mm_storeu_ps
and _mm_loadu_ps
, which are unaligned versions. They will run a little slower, but they will work. After you've verified that's the problem, try aligning the memory in the first place for maximum performance.