Optimal NEON vector structure for processing vectors of uint8_t type with Arm Cortex-A8 (32-bit)

https://stackoverflow.com/questions/19419678

01-07-2022
|

Question

I am doing some image processing on an embedded system (BeagleBone Black) using OpenCV and need to write some code to take advantage of NEON optimization. Specifically, I would like to write a NEON optimized thresholding function and then a NEON optimized erosion/dilation function.

This is my first time writing NEON code and I don't have experience writing assmbly code, so I have been looking at examples and resources for the C-style NEON intrinsics. I believe that I can put some working code together, but am not sure how I should structure the vectors. According to page 2 of the "ARM NEON support in the ARM compiler" white paper:

"These registers can hold "vectors" of items which are 8, 16, 32 or 64 bits. The traditional advice when optimizing or porting algorithms written in C/C++ is to use the natural type of the machine for data handling (in the case of ARM 32 bits). The unwanted bits can then be discarded by casting and/or shifting before storing to memory."

What exactly does this mean? Do I need to to restrict my NEON code to using uint32x4_t vectors rather than uint8x16_t? How would I go about loading the registers? Or does this mean than I need to take some special steps when using vst1q_u8 to store the data to memory?

I did find this example, which is untested but uses the uint8x16_t type. Does it adhere to the "32-bit" advice given above?

I would really appreciate it if someone could please elaborate on the above quotation and maybe provide a very simple working example.

La solution

The next sentence from the document you linked gives your answer.

The ability of NEON to specify the data width in the instruction and hence use the whole register width for useful information means keeping the natural type for the algorithm is both possible and preferable.

Note, the document is distinguishing between the natural type of the machine (32-bit) and the natural type of the algorithm (in your case uint8_t).

The document is saying that in the past you would have written your code in such a way that it used 32-bit integers so that it could use the efficient machine instructions suited for 32-bit operations.

With Neon, this is not necessary. It is more useful to use the data type you actually want to use, as Neon can efficiently operate on those data types.

It will depend on your algorithm as to the optimal choice of register width (uint8x8_t or uint8x16_t).

To give a simple example of using the Neon intrinsics to add two sets of uint8_t:

#include <arm_neon.h>
void
foo (uint8_t a, uint8_t *b, uint8_t *c)
{
  uint8x16_t t1 = vld1q_u8 (a);
  uint8x16_t t2 = vld1q_u8 (b);
  uint8x16_t t3 = vaddq_u8 (a, b);
  vst1q_u8 (c, t3);
}

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow