Suggestion in ARM NEON optimization

Question 1

After the comparison, you have an array of 8 booleans represented by 0xff or 0x00. The reason SIMD comparisons (on any architecture) produce those values is to make them useful for a bit-mask operation (and/or bit-select in NEON's case) so you can turn the result into an arbitrary value quickly, without a multiply.

So rather than reducing them to 1 or 0 and shifting them about, you'll find it easier to mask them with the constant 0x8040201008040201. Then each lane contains the bit corresponding to its position in the final result. You can pre-load the constant into another register (I'll use d3).

VAND d0, d2, d3

Then, to combine the results, you can use VPADD (instead of OR), which will combine adjacent pairs of lanes, d0[0] = d0[0] + d0[1], d0[1] = d0[2] + d0[3], etc... Since the bit patterns do not overlap there is no carry and add works just as well as or. Also, because the output is half as large as the input we have to fill in the second half with junk. I've used a second copy of d0 for that.

You'll need to do the add three times to get all columns combined.

VPADD.u8 d0, d0, d0
VPADD.u8 d0, d0, d0
VPADD.u8 d0, d0, d0

and now the result will now be in d0[0].

As you can see, d0 has room for seven more results; and some lanes of the VPADD operations have been working with junk data. It would be better if you could fetch more data at once, and feed that additional work in as you go so that none of the arithmetic is wasted.

EDIT

Supposing the loop is unrolled four times; with results in d4, d5, d6, and d7; the constant mentioned earlier should be loaded into, say, d30 and d31, and then some q register arithmetic can be used:

VAND q0, q2, q15
VAND q1, q3, q15

VPADD.u8 d0, d0, d1
VPADD.u8 d2, d2, d3
VPADD.u8 d0, d0, d2
VPADD.u8 d0, d0, d0

With the final result in d0[0..3], or simply the 32-bit value in d0[0].

There seem to be lots of registers free to unroll it further, but I don't know how many of those you'll use up on other calculations.

Question 2

load a d register with the value 0x8040201008040201
vand with the result of vclt
vpaddl.u8 from 2)
vpaddl.u16 from 3)
vpaddl.u32 from 4)
store the lowest single byte from 5)

Question 3

Start with expressing the parallelism explicitly to begin with:

int /* bool, whatever ... */ val[8] = {
    center[0] < center[1],
    center[2] < center[3],
    center[4] < center[5],
    center[6] < center[7],
    center[8] < center[9],
    center[10] < center[11],
    center[12] < center[13],
    center[14] < center[15]
};
d[i] = extract_mask(val);

The shifts are equivalent to a "mask move", as you want each comparison to result in a single bit.

The comparison of the above sixteen values can be done by first doing a structure load (vld2.8) to split adjacent bytes into two uint8x8_t, then the parallel compare. The result of that is a uint8x8_t with either 0xff or 0x00 in the bytes. You want one bit of each, in the respective bit position.

That's a "mask extract"; on Intel SSE2, that'd be MASKMOV but on Neon, no direct equiv exists; three vpadd as shown above (or see SSE _mm_movemask_epi8 equivalent method for ARM NEON for more on this) are a suitable substitute.