Using XMM0 register and memory fetches (C++ code) is twice as fast as ASM only using XMM registers - Why?

Question 1

Once the data is in the cache (which it will be the case after the first loop, if it's not there already), it makes little difference if you use memory or register.
A floating point add will take a little longer than single cycle in the first place.
The final store to resVal "unties" the xmm0 register to allow the register to be freely "renamed", which allows more of the loops to be run in parallel.

This is a typical case of "unless you are absolutely sure, leave writing code to the compiler".

The last bullet above explains why the code is faster than code where every step of the loop depends on a previously calculated result.

In the compiler generated code, the loop can do the equivalent of:

movsd       xmm0,mmword ptr [val1]  
addsd       xmm0,mmword ptr [val2]  
addsd       xmm0,mmword ptr [val3]  
addsd       xmm0,mmword ptr [val4]  
addsd       xmm0,mmword ptr [val5]  
addsd       xmm0,mmword ptr [val6]  
addsd       xmm0,mmword ptr [val7]  
addsd       xmm0,mmword ptr [resVal]  
movsd       mmword ptr [resVal],xmm0  

movsd       xmm1,mmword ptr [val1]  
addsd       xmm1,mmword ptr [val2]  
addsd       xmm1,mmword ptr [val3]  
addsd       xmm1,mmword ptr [val4]  
addsd       xmm1,mmword ptr [val5]  
addsd       xmm1,mmword ptr [val6]  
addsd       xmm1,mmword ptr [val7]  
addsd       xmm1,mmword ptr [resVal]  
movsd       mmword ptr [resVal],xmm1

Now, as you can see, we could "mingle" these two "threads":

movsd       xmm0,mmword ptr [val1]  
movsd       xmm1,mmword ptr [val1]  
addsd       xmm0,mmword ptr [val2]  
addsd       xmm1,mmword ptr [val2]  
addsd       xmm0,mmword ptr [val3]  
addsd       xmm1,mmword ptr [val3]  
addsd       xmm0,mmword ptr [val4]  
addsd       xmm1,mmword ptr [val4]  
addsd       xmm0,mmword ptr [val5]  
addsd       xmm1,mmword ptr [val5]  
addsd       xmm0,mmword ptr [val6]  
addsd       xmm1,mmword ptr [val6]  
addsd       xmm0,mmword ptr [val7]  
addsd       xmm1,mmword ptr [val7]  
addsd       xmm0,mmword ptr [resVal]  
movsd       mmword ptr [resVal],xmm0  
// Here we have to wait for resval to be uppdated!
addsd       xmm1,mmword ptr [resVal]  
movsd       mmword ptr [resVal],xmm1

I'm not suggesting it is quite that much out of order execution, but I can certainly see how the loop can be executed faster that your loop. You can probably achieve the same thing in your assembler code if you had a spare register [in x86_64 you do have another 8 registers, although you can't use inline assembler in x86_64...]

(Note that register renaming is different from my "threaded" loop, which is using two different registers - but the effect is roughly the same, the loop can continue after it hits the "resVal" update without having to wait for the result to be updated)

Question 2

May be it's useful for you not use _asm, but intrinsics functions and instrinsic types like __m128i of __m128d witch are represent sse registers. See immintrin.h it's define types and many sse functions. you can find good description and spec for them here :http://software.intel.com/sites/landingpage/IntrinsicsGuide/