The alignment (@128, @256) is used to hint the processor that the read/write does not cross a cache line. In those cases the number of executed micro-operations (and eventually used cycles) can be reduced.
Your application would instead benefit from the instruction format that allows storing / loading columns. Arm manual calls these lanes as in subsection (Store a single lane of N-element structure to memory).
The format supports both consecutive registers: {d0,d1,d2,...} and skipping over one register {d0,d2,d4,...}.
mov #128, r1 // initialize value for increment
vst4.32 { d0[0], d2[0], d4[0], d6[0] }, [r0], r1 // columns 0..1
vst4.32 { d0[1], d2[1], d4[1], d6[1] }, [r0], r1 // store at offset 128
vst4.32 { d1[0], d3[0], d5[0], d7[0] }, [r0], r1 // columns 2..3
vst4.32 { d1[1], d3[1], d5[1], d7[1] }, [r0], r1 // columns 2..3
... etc ...
This is probably the best one can do, as there are not enough registers to shuffle everything in place. I believe one would need 32 Q-registers.