If you can read in sets of 4 for a and b, this can be done most cleanly/elegantly/efficiently by operating on 4-tuples because it will more fully saturate some of the instructions, so that all parts of the computation are part of the solution. The below solution uses up to SSSE3. Of course you will be better off to pull this out of the function, initialize constants up front, and find the most efficient way to put the values into the __m128i
values depending on how the surrounding code is structured.
// a, b, and out, must all point to 4 integers
void norm2x4(const unsigned *a, const unsigned *b, unsigned *out) {
// load up registers a and b, in practice this should probably not be in a function,
// initialization of zero can happen outside of a loop,
// and a and b can be loaded directly from memory into __m128i registers
__m128i const zero = _mm_setzero_si128();
__m128i alo = _mm_loadu_si128((__m128i*)a); // this can also be adapted to aligned read instructions if you ensure an aligned buffer
__m128i blo = _mm_loadu_si128((__m128i*)b);
// everything is already in the register where we need it except it
// needs to be expanded to 2-byte ints for computations to work correctly
__m128i ahi = _mm_unpackhi_epi8(alo, zero);
__m128i bhi = _mm_unpackhi_epi8(blo, zero);
alo = _mm_unpacklo_epi8(alo, zero);
blo = _mm_unpacklo_epi8(blo, zero);
alo = _mm_sub_epi16(alo, blo); // don't care if a - b, or b - a, the "wrong" one will result in a
ahi = _mm_sub_epi16(ahi, bhi); // negation the square will later correct
alo = _mm_madd_epi16(alo, alo); // perform the square, and add every two adjacent
ahi = _mm_madd_epi16(ahi, ahi);
alo = _mm_hadd_epi32(alo, ahi); // add horizontal elements; ahi now contains 4 ints which are your results
// store the result to output; this can be adapted to an aligned store if you ensure an aligned buffer
// or the individual values can be extracted directly to 32-bit registers using _mm_extract_epi32
_mm_storeu_si128((__m128i*)out, alo);
}