Domanda

I am writing a piece of subroutine in NEON for image processing which does color swapping, i.e., I sequentialy load the R,G,B channels from an array, and depending on some configuration, permute some of them.

There are as maximum 6 permutes

  (RGB) -> { (RGB),(RBG),(GRB),(GBR),(BRG),(BGR) }

The most efficient way would be to have a separate subroutine for each case and the corresponding VSWP instructions. As the Subroutine will do several other things, I would prefer to keep everything in just one sub, even if it is not so efficient,

Also have read that conditional execution and branching is not advisable. So, if I want to have it in a block with branchless code, the only thing coming to my mind is

New_R = a(0)*R+a(1)*G+a(2)*B
New_G = a(3)*R+a(4)*G+a(5)*B
New_B = a(6)*R+a(7)*G+a(8)*B

where only one a(i) in each row and column will be =1 each time, and the rest will be =0

Question: Any smarter way to do it, having in mind that it has to be coded to NEON?

È stato utile?

Soluzione

VTBL.8 is the most powerful tool in NEON to swap bytes.

Loading 3x8 bytes to registers d0,d1,d2 would look like

  R G B R G B R G | B R G B R G B R | G B R G B R G B |
  0 1 2 3 4 5 6 7   8 9 a b c d e f ....            17

VTBL d3, { d0,d1,d2 }, d6  ;; select bytes to d3 from d0,d1,d2 based on d6
VTBL d4, { d0,d1,d2 }, d7
VTBL d5, { d0,d1,d2 }, d8

where d6,d7,d8 encode the positions to read in the new bytes. e.g. '0 1 2 3 4 5 6 7' for the original permutation and '0 2 1 3 5 4 6 8', '7 ...' to swap G and B. The constant vectors d6..d8 need to be loaded just once in the beginning of the routine.

Another possibility is to encode the following sequence with interleaved read;

VLD3.8 { d0,d1,d2 }, [r0]    ; // Read R, G, B to separate registers
VLD3.8 { d3,d4,d5 }, [r0]    ; // Make a second copy (or use some other instruction)

VBIT d3, d1, d6              ; // d3 is now either R or G
VBIT d4, d2, d7              ; // d4 is now either G or B
VBIT d5, d0, d8              ; // d5 is now either B or R

VBIT d0, d4, d9              ; // d0 is now R or (G or B)
VBIT d1, d5, d10             ; // d1 is now G or (B or R)
VBIT d2, d3, d11             ; // d2 is now B or (R or G)

Even though 6 registers for the condition codes are used in the example, 3 independent registers should be enough -- one can also use VBIF if reversed logic needs to be used.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top