How to load uint8x8_t into float32x4 in ARM NEON?

Question 1

As I know, NEON supports only 32bit conversions (using vcvt_...() you can convert between float32x4_t and int32x4_t (e.g.)). So you will need to convert your uint8x8_t to uint32x4x2_t and after that use the vcvt for both halfs of the uint32x4x2_t.

EDIT: Unfortunately, I can not provide you code as I did not worked with it a lot of time and can not remember commands.

Question 2

So you want to convert them to float for some arithmetic and convert the results back to int? That's exactly the opposite of what people call optimization.

Stick with fixed point arithmetic where NEON truly shines.

I can hardly imagine any cases where converting to float would make sense dealing with ARGB format where each channel is only 8bits in size(and in precision).

Apparently you are trying to let NEON just do the converting back and forth while the float arithmetic is done in by ARM, but that's exactly the wrong way to utilize NEON.

A proper NEON optimized function shall let NEON handle data load, arithmetic, and data store all by itself. When properly done, I'm sure the NEON version will run over 20times faster than your current one, at near memcpy speed. - NEON is THAT powerful with fixed point arithmetic.

Please reveal more information what you are trying to do. Maybe I can help.

Question 3

The VTBX table lookup instruction can do unsigned 8bit-32bit extension in a single operation, but unfortunately the output is a single neon register (would be uint32x2_t), so to "fill" a uint32x4_t you need to invoke it twice. For all eight bytes of a uint8x8_t source, you'd have to do:

uint8x8_t bvec = vld1_u8(psrc);

uint8x8x4_t tbl = {
    { 0, -1, -1, -1, 1, -1, -1, -1 },
    { 2, -1, -1, -1, 3, -1, -1, -1 }
    { 4, -1, -1, -1, 5, -1, -1, -1 }
    { 6, -1, -1, -1, 7, -1, -1, -1 }
};

uint32x4_t ivec[2] = {
    {
    vreinterpret_u32_u8(vtbx1_u8(tbl[0], bvec, 0)),
    vreinterpret_u32_u8(vtbx1_u8(tbl[1], bvec, 0))
    },
    {
    vreinterpret_u32_u8(vtbx1_u8(tbl[2], bvec, 0)),
    vreinterpret_u32_u8(vtbx1_u8(tbl[3], bvec, 0))
    }
};

float32x4_t vec[2] = { vcvtq_f32_u32(ivec[0]), vcvtq_f32_u32(ivec[1]) };

I don't think it's less instructions than the method you found. The lookup table would come from memory as well, so it might be slower. Then there's also the need for vreinterpret... ... that's a free operation, but looks cruddy.