Question

I have 128 32-bit values (numbered from 0 to 127) and have them ordered in the following way:

0, 32, 64, 96, 1, 33, 65, 97, ... 31, 63, 95, 127

With this ordering I load them in groups of 4 into NEON registers to perform some computation (which requires such ordering). Thus I have q0 = (0, 32, 64, 96) ... and so on.

I was wondering if there's some easy way storing them back to the memory in their natural order (0 1 2 3 ...)

In other words, is there some easier way or a trick to do this:

vst1.u32 {d0[0]}, [r0]
vst1.u32 {d0[1]}, [r0,#128]
vst1.u32 {d1[0]}, [r0,#256]
vst1.u32 {d1[1]}, [r0,#384]
vst1.u32 {d2[0]}, [r0,#4]
vst1.u32 {d2[1]}, [r0,#132]
...

I dont quite understand the use of @alignment suffix with vstx and vldx instructions. Isn't this a case where it could be useful?

Was it helpful?

Solution

The alignment (@128, @256) is used to hint the processor that the read/write does not cross a cache line. In those cases the number of executed micro-operations (and eventually used cycles) can be reduced.

Your application would instead benefit from the instruction format that allows storing / loading columns. Arm manual calls these lanes as in subsection (Store a single lane of N-element structure to memory).

The format supports both consecutive registers: {d0,d1,d2,...} and skipping over one register {d0,d2,d4,...}.

 mov #128, r1  // initialize value for increment
 vst4.32 { d0[0], d2[0], d4[0], d6[0] }, [r0], r1   // columns 0..1
 vst4.32 { d0[1], d2[1], d4[1], d6[1] }, [r0], r1   // store at offset 128
 vst4.32 { d1[0], d3[0], d5[0], d7[0] }, [r0], r1   // columns 2..3
 vst4.32 { d1[1], d3[1], d5[1], d7[1] }, [r0], r1   // columns 2..3
 ... etc ...

This is probably the best one can do, as there are not enough registers to shuffle everything in place. I believe one would need 32 Q-registers.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top