Compute norm between two integers interpreted as 4 bytes

Question 1

If you can read in sets of 4 for a and b, this can be done most cleanly/elegantly/efficiently by operating on 4-tuples because it will more fully saturate some of the instructions, so that all parts of the computation are part of the solution. The below solution uses up to SSSE3. Of course you will be better off to pull this out of the function, initialize constants up front, and find the most efficient way to put the values into the __m128i values depending on how the surrounding code is structured.

// a, b, and out, must all point to 4 integers
void norm2x4(const unsigned *a, const unsigned *b, unsigned *out) {
  // load up registers a and b, in practice this should probably not be in a function,
  // initialization of zero can happen outside of a loop,
  // and a and b can be loaded directly from memory into __m128i registers
  __m128i const zero = _mm_setzero_si128();
  __m128i       alo  = _mm_loadu_si128((__m128i*)a); // this can also be adapted to aligned read instructions if you ensure an aligned buffer
  __m128i       blo  = _mm_loadu_si128((__m128i*)b);

  // everything is already in the register where we need it except it
  // needs to be expanded to 2-byte ints for computations to work correctly
  __m128i       ahi = _mm_unpackhi_epi8(alo, zero);
  __m128i       bhi = _mm_unpackhi_epi8(blo, zero);
  alo               = _mm_unpacklo_epi8(alo, zero);
  blo               = _mm_unpacklo_epi8(blo, zero);
  alo               = _mm_sub_epi16(alo, blo);  // don't care if a - b, or b - a, the "wrong" one will result in a
  ahi               = _mm_sub_epi16(ahi, bhi);  // negation the square will later correct
  alo               = _mm_madd_epi16(alo, alo); // perform the square, and add every two adjacent
  ahi               = _mm_madd_epi16(ahi, ahi);
  alo               = _mm_hadd_epi32(alo, ahi); // add horizontal elements; ahi now contains 4 ints which are your results

  // store the result to output; this can be adapted to an aligned store if you ensure an aligned buffer
  // or the individual values can be extracted directly to 32-bit registers using _mm_extract_epi32
  _mm_storeu_si128((__m128i*)out, alo);
}

Question 2

Very simple to do the whole thing in a few instructions using SSE:

#include <immintrin.h>
#include <stdint.h>

uint32_t norm2(uint32_t a, uint32_t b) {
    const __m128i vec_zero = _mm_setzero_si128();
    __m128i vec_a = _mm_unpacklo_epi8(_mm_cvtsi32_si128(a), vec_zero);
    __m128i vec_b = _mm_unpacklo_epi8(_mm_cvtsi32_si128(b), vec_zero);
    __m128i vec_diff = _mm_sub_epi16(vec_a, vec_b);
    __m128i vec_dsq = _mm_madd_epi16(vec_diff, vec_diff);
    return _mm_cvtsi128_si32(_mm_hadd_epi32(vec_dsq, vec_dsq));
}

What we’re doing here is “unpacking” both a and b with a zero vector to expand the individual bytes into vectors of 16-bit integers. We then subtract them (as 16-bit integers, avoiding risk of overflow), and multiply and accumulate them (as 32-bit integers, again avoiding risk of overflow).

I don’t have GCC installed to test with, but the above generates near-optimal assembly with clang; it shouldn’t be necessary to drop into assembly for such a simple task.

Question 3

a branchless version (as square(-x) == square(x)):

uint32_t sqd(int32_t a, int32_t b) {
  int32_t x = a - b;
  return x * x;
}

uint32_t norm2(uint32_t a, uint32_t b) {
  return sqd( a & 0x000000FF     , b &  0x000000FF      )
       + sqd((a & 0x0000FF00) >>  8, (b & 0x0000FF00) >>  8)
       + sqd((a & 0x00FF0000) >> 16, (b & 0x00FF0000) >> 16)
       + sqd((a & 0xFF000000) >> 24, (b & 0xFF000000) >> 24);
}