VTBL.8
is the most powerful tool in NEON to swap bytes.
Loading 3x8 bytes to registers d0,d1,d2 would look like
R G B R G B R G | B R G B R G B R | G B R G B R G B |
0 1 2 3 4 5 6 7 8 9 a b c d e f .... 17
VTBL d3, { d0,d1,d2 }, d6 ;; select bytes to d3 from d0,d1,d2 based on d6
VTBL d4, { d0,d1,d2 }, d7
VTBL d5, { d0,d1,d2 }, d8
where d6,d7,d8 encode the positions to read in the new bytes. e.g. '0 1 2 3 4 5 6 7' for the original permutation and '0 2 1 3 5 4 6 8', '7 ...' to swap G and B. The constant vectors d6..d8 need to be loaded just once in the beginning of the routine.
Another possibility is to encode the following sequence with interleaved read;
VLD3.8 { d0,d1,d2 }, [r0] ; // Read R, G, B to separate registers
VLD3.8 { d3,d4,d5 }, [r0] ; // Make a second copy (or use some other instruction)
VBIT d3, d1, d6 ; // d3 is now either R or G
VBIT d4, d2, d7 ; // d4 is now either G or B
VBIT d5, d0, d8 ; // d5 is now either B or R
VBIT d0, d4, d9 ; // d0 is now R or (G or B)
VBIT d1, d5, d10 ; // d1 is now G or (B or R)
VBIT d2, d3, d11 ; // d2 is now B or (R or G)
Even though 6 registers for the condition codes are used in the example, 3 independent registers should be enough -- one can also use VBIF if reversed logic needs to be used.