After the comparison, you have an array of 8 booleans represented by 0xff
or 0x00
. The reason SIMD comparisons (on any architecture) produce those values is to make them useful for a bit-mask operation (and/or bit-select in NEON's case) so you can turn the result into an arbitrary value quickly, without a multiply.
So rather than reducing them to 1
or 0
and shifting them about, you'll find it easier to mask them with the constant 0x8040201008040201
. Then each lane contains the bit corresponding to its position in the final result. You can pre-load the constant into another register (I'll use d3
).
VAND d0, d2, d3
Then, to combine the results, you can use VPADD
(instead of OR
), which will combine adjacent pairs of lanes, d0[0] = d0[0] + d0[1]
, d0[1] = d0[2] + d0[3]
, etc... Since the bit patterns do not overlap there is no carry and add works just as well as or. Also, because the output is half as large as the input we have to fill in the second half with junk. I've used a second copy of d0
for that.
You'll need to do the add three times to get all columns combined.
VPADD.u8 d0, d0, d0
VPADD.u8 d0, d0, d0
VPADD.u8 d0, d0, d0
and now the result will now be in d0[0]
.
As you can see, d0
has room for seven more results; and some lanes of the VPADD
operations have been working with junk data. It would be better if you could fetch more data at once, and feed that additional work in as you go so that none of the arithmetic is wasted.
EDIT
Supposing the loop is unrolled four times; with results in d4
, d5
, d6
, and d7
; the constant mentioned earlier should be loaded into, say, d30
and d31
, and then some q
register arithmetic can be used:
VAND q0, q2, q15
VAND q1, q3, q15
VPADD.u8 d0, d0, d1
VPADD.u8 d2, d2, d3
VPADD.u8 d0, d0, d2
VPADD.u8 d0, d0, d0
With the final result in d0[0..3], or simply the 32-bit value in d0[0].
There seem to be lots of registers free to unroll it further, but I don't know how many of those you'll use up on other calculations.