Well, there are basically two options here.
add ecx, 16
movaps XMMWORD PTR [ecx-16], xmm1 ; stall for ecx?
cmp ecx, edx
jb loop
or
movaps XMMWORD PTR [ecx], xmm1
add ecx, 16
cmp ecx, edx ; stall for ecx?
jb loop
In option 1 you have a potential stall between add
and movaps
. In option 2 you have a potential stall between add
and cmp
. However, there is also the issue of the execution unit used. add
and cmp
(=sub
) use the ALU, while the [ecx-16]
uses AGU (Address Generation Unit), I believe. So I suspect there might be a slight win in option 1 because ALU use is interleaved with AGU use.