Question

I am writing a piece of subroutine in NEON for image processing which does color swapping, i.e., I sequentialy load the R,G,B channels from an array, and depending on some configuration, permute some of them.

There are as maximum 6 permutes

  (RGB) -> { (RGB),(RBG),(GRB),(GBR),(BRG),(BGR) }

The most efficient way would be to have a separate subroutine for each case and the corresponding VSWP instructions. As the Subroutine will do several other things, I would prefer to keep everything in just one sub, even if it is not so efficient,

Also have read that conditional execution and branching is not advisable. So, if I want to have it in a block with branchless code, the only thing coming to my mind is

New_R = a(0)*R+a(1)*G+a(2)*B
New_G = a(3)*R+a(4)*G+a(5)*B
New_B = a(6)*R+a(7)*G+a(8)*B

where only one a(i) in each row and column will be =1 each time, and the rest will be =0

Question: Any smarter way to do it, having in mind that it has to be coded to NEON?

Was it helpful?

Solution

VTBL.8 is the most powerful tool in NEON to swap bytes.

Loading 3x8 bytes to registers d0,d1,d2 would look like

  R G B R G B R G | B R G B R G B R | G B R G B R G B |
  0 1 2 3 4 5 6 7   8 9 a b c d e f ....            17

VTBL d3, { d0,d1,d2 }, d6  ;; select bytes to d3 from d0,d1,d2 based on d6
VTBL d4, { d0,d1,d2 }, d7
VTBL d5, { d0,d1,d2 }, d8

where d6,d7,d8 encode the positions to read in the new bytes. e.g. '0 1 2 3 4 5 6 7' for the original permutation and '0 2 1 3 5 4 6 8', '7 ...' to swap G and B. The constant vectors d6..d8 need to be loaded just once in the beginning of the routine.

Another possibility is to encode the following sequence with interleaved read;

VLD3.8 { d0,d1,d2 }, [r0]    ; // Read R, G, B to separate registers
VLD3.8 { d3,d4,d5 }, [r0]    ; // Make a second copy (or use some other instruction)

VBIT d3, d1, d6              ; // d3 is now either R or G
VBIT d4, d2, d7              ; // d4 is now either G or B
VBIT d5, d0, d8              ; // d5 is now either B or R

VBIT d0, d4, d9              ; // d0 is now R or (G or B)
VBIT d1, d5, d10             ; // d1 is now G or (B or R)
VBIT d2, d3, d11             ; // d2 is now B or (R or G)

Even though 6 registers for the condition codes are used in the example, 3 independent registers should be enough -- one can also use VBIF if reversed logic needs to be used.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top