As @Michael pointed out, VQRSHL
is the appropriate shift-by-register instruction here - fortunately, right happens to be negative left. I'd use a VDUP
to turn r0
into an appropriate vector of shift values first, and a VQMOVN
afterwards for the narrowing. All of these are available as intrinsics to help keep the nastiness of inline assembly at bay, something like this:
vshift = vdupq_n_s32(-shift);
result = vqmovn_u32(vqrshlq_u32(data, vshift));