- Once the data is in the cache (which it will be the case after the first loop, if it's not there already), it makes little difference if you use memory or register.
- A floating point add will take a little longer than single cycle in the first place.
- The final store to
resVal
"unties" the xmm0 register to allow the register to be freely "renamed", which allows more of the loops to be run in parallel.
This is a typical case of "unless you are absolutely sure, leave writing code to the compiler".
The last bullet above explains why the code is faster than code where every step of the loop depends on a previously calculated result.
In the compiler generated code, the loop can do the equivalent of:
movsd xmm0,mmword ptr [val1]
addsd xmm0,mmword ptr [val2]
addsd xmm0,mmword ptr [val3]
addsd xmm0,mmword ptr [val4]
addsd xmm0,mmword ptr [val5]
addsd xmm0,mmword ptr [val6]
addsd xmm0,mmword ptr [val7]
addsd xmm0,mmword ptr [resVal]
movsd mmword ptr [resVal],xmm0
movsd xmm1,mmword ptr [val1]
addsd xmm1,mmword ptr [val2]
addsd xmm1,mmword ptr [val3]
addsd xmm1,mmword ptr [val4]
addsd xmm1,mmword ptr [val5]
addsd xmm1,mmword ptr [val6]
addsd xmm1,mmword ptr [val7]
addsd xmm1,mmword ptr [resVal]
movsd mmword ptr [resVal],xmm1
Now, as you can see, we could "mingle" these two "threads":
movsd xmm0,mmword ptr [val1]
movsd xmm1,mmword ptr [val1]
addsd xmm0,mmword ptr [val2]
addsd xmm1,mmword ptr [val2]
addsd xmm0,mmword ptr [val3]
addsd xmm1,mmword ptr [val3]
addsd xmm0,mmword ptr [val4]
addsd xmm1,mmword ptr [val4]
addsd xmm0,mmword ptr [val5]
addsd xmm1,mmword ptr [val5]
addsd xmm0,mmword ptr [val6]
addsd xmm1,mmword ptr [val6]
addsd xmm0,mmword ptr [val7]
addsd xmm1,mmword ptr [val7]
addsd xmm0,mmword ptr [resVal]
movsd mmword ptr [resVal],xmm0
// Here we have to wait for resval to be uppdated!
addsd xmm1,mmword ptr [resVal]
movsd mmword ptr [resVal],xmm1
I'm not suggesting it is quite that much out of order execution, but I can certainly see how the loop can be executed faster that your loop. You can probably achieve the same thing in your assembler code if you had a spare register [in x86_64 you do have another 8 registers, although you can't use inline assembler in x86_64...]
(Note that register renaming is different from my "threaded" loop, which is using two different registers - but the effect is roughly the same, the loop can continue after it hits the "resVal" update without having to wait for the result to be updated)