The Neon "halving add" operation vhadd
works like this:
A = (B + C) >> 1
whereas the SSE average intrinsic _mm_avg_epu8
does this:
A = (B + C + 1) >> 1
In other words Neon does a truncating average with its "halving add" operation, whereas SSE correctly rounds the result.
Fortunately there is a Neon instruction which rounds in the same way as SSE's _mm_avg_epu8
- it's called vrhadd - Vector Rounding Halving Add.