Question

I'm working on an image manipulation algorithm, and I'm looking into optimizing it using NEON. The algorithm consists of multiplying each (RGBA, 8bit) pixel by some weight, doing some addition and finally converting back to uint8_t values. The first problem I have is how to efficiently load and convert a single uint8_t pixel into NEON's float32x4_t. I've searched the reference for a suitable conversion, and couldn't find one that fits, so I resorted to this ugly code:

const uint8_t* psrc = ...; // pointer to image data
float rgba[4];
for (int c = 0; c < 4; ++c) {
  rgba[c] = *psrc++;
}
float32x4_t srcpix = vld1q_f32(rgba);

Anyone can suggest a 'cleaner' way for this?

EDIT: so I came up with this, still feels cumbersome:

uint8x8_t srcu8 = vld1_u8(psrc);
uint16x8_t srcu16x8 = vmovl_u8(srcu8);
uint16x4_t srcu16x4 = vget_low_u16(srcu16x8);
uint32x4_t srcu32x4 = vmovl_u16(srcu16x4);
srcpix = vcvtq_f32_u32(srcu32x4);
Was it helpful?

Solution 3

As I know, NEON supports only 32bit conversions (using vcvt_...() you can convert between float32x4_t and int32x4_t (e.g.)). So you will need to convert your uint8x8_t to uint32x4x2_t and after that use the vcvt for both halfs of the uint32x4x2_t.

EDIT: Unfortunately, I can not provide you code as I did not worked with it a lot of time and can not remember commands.

OTHER TIPS

So you want to convert them to float for some arithmetic and convert the results back to int? That's exactly the opposite of what people call optimization.

Stick with fixed point arithmetic where NEON truly shines.

I can hardly imagine any cases where converting to float would make sense dealing with ARGB format where each channel is only 8bits in size(and in precision).

Apparently you are trying to let NEON just do the converting back and forth while the float arithmetic is done in by ARM, but that's exactly the wrong way to utilize NEON.

A proper NEON optimized function shall let NEON handle data load, arithmetic, and data store all by itself. When properly done, I'm sure the NEON version will run over 20times faster than your current one, at near memcpy speed. - NEON is THAT powerful with fixed point arithmetic.

Please reveal more information what you are trying to do. Maybe I can help.

The VTBX table lookup instruction can do unsigned 8bit-32bit extension in a single operation, but unfortunately the output is a single neon register (would be uint32x2_t), so to "fill" a uint32x4_t you need to invoke it twice. For all eight bytes of a uint8x8_t source, you'd have to do:

uint8x8_t bvec = vld1_u8(psrc);

uint8x8x4_t tbl = {
    { 0, -1, -1, -1, 1, -1, -1, -1 },
    { 2, -1, -1, -1, 3, -1, -1, -1 }
    { 4, -1, -1, -1, 5, -1, -1, -1 }
    { 6, -1, -1, -1, 7, -1, -1, -1 }
};

uint32x4_t ivec[2] = {
    {
    vreinterpret_u32_u8(vtbx1_u8(tbl[0], bvec, 0)),
    vreinterpret_u32_u8(vtbx1_u8(tbl[1], bvec, 0))
    },
    {
    vreinterpret_u32_u8(vtbx1_u8(tbl[2], bvec, 0)),
    vreinterpret_u32_u8(vtbx1_u8(tbl[3], bvec, 0))
    }
};

float32x4_t vec[2] = { vcvtq_f32_u32(ivec[0]), vcvtq_f32_u32(ivec[1]) };

I don't think it's less instructions than the method you found. The lookup table would come from memory as well, so it might be slower. Then there's also the need for vreinterpret... ... that's a free operation, but looks cruddy.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top