The VTBX
table lookup instruction can do unsigned 8bit-32bit extension in a single operation, but unfortunately the output is a single neon register (would be uint32x2_t
), so to "fill" a uint32x4_t
you need to invoke it twice. For all eight bytes of a uint8x8_t
source, you'd have to do:
uint8x8_t bvec = vld1_u8(psrc);
uint8x8x4_t tbl = {
{ 0, -1, -1, -1, 1, -1, -1, -1 },
{ 2, -1, -1, -1, 3, -1, -1, -1 }
{ 4, -1, -1, -1, 5, -1, -1, -1 }
{ 6, -1, -1, -1, 7, -1, -1, -1 }
};
uint32x4_t ivec[2] = {
{
vreinterpret_u32_u8(vtbx1_u8(tbl[0], bvec, 0)),
vreinterpret_u32_u8(vtbx1_u8(tbl[1], bvec, 0))
},
{
vreinterpret_u32_u8(vtbx1_u8(tbl[2], bvec, 0)),
vreinterpret_u32_u8(vtbx1_u8(tbl[3], bvec, 0))
}
};
float32x4_t vec[2] = { vcvtq_f32_u32(ivec[0]), vcvtq_f32_u32(ivec[1]) };
I don't think it's less instructions than the method you found. The lookup table would come from memory as well, so it might be slower. Then there's also the need for vreinterpret...
... that's a free operation, but looks cruddy.