On recent Intel CPUs (Core i7 et al) the misaligned load is a reasonable approach, but on older CPUs it's relatively expensive. An alternate approach is to use _mm_alignr_epi8
(PALIGNR
) - typically you iterate along a row and maintain 3 consecutive vectors - after each iteration you shuffle these vectors along one and then load a new vector, so there is only one load per iteration.
__m128 va = _mm_setzero_ps();
__m128 vb = _mm_load_ps(&from[row][0]);
for (col = 0; col < N; col += 4)
{
__m128 vc = _mm_load_ps(&from[row][col + 4]);
__m128 centre = vb;
__m128 left = (__m128)_mm_alignr_epi8((__m128i)va, (__m128i)vb, sizeof(float));
__m128 right = (__m128)_mm_alignr_epi8((__m128i)vb, (__m128i)vc, 3 * sizeof(float));
// do stuff ...
va = vb; // shuffle vectors along
vb = vc;
}
AVX is a bit trickier due to the limitations of 128 bit lanes - you might be better off sticking with unaligned loads.