As ScottD pointed out, the answer to the question lies in the generated assembly code. Apparently the Intel compiler is smart enough to detect the access pattern and automatically generates non-temporal loads even for the temporal version.
Here is a the compiler-generated assembly code for the temporal version:
..___tag_value___Z13copy_temporalPfS_.35: #
xor edx, edx #22.4
xor eax, eax #
..B2.2: # Preds ..B2.2 ..B2.1
vmovups xmm0, XMMWORD PTR [rax+rdi] #23.34
inc rdx #22.4
vmovntps XMMWORD PTR [rax+rsi], xmm0 #23.20
vmovups xmm1, XMMWORD PTR [16+rax+rdi] #24.36
vmovntps XMMWORD PTR [16+rax+rsi], xmm1 #24.20
vmovups xmm2, XMMWORD PTR [32+rax+rdi] #23.34
vmovntps XMMWORD PTR [32+rax+rsi], xmm2 #23.20
vmovups xmm3, XMMWORD PTR [48+rax+rdi] #24.36
vmovntps XMMWORD PTR [48+rax+rsi], xmm3 #24.20
add rax, 64 #22.4
cmp rdx, 5000000 #22.4
jb ..B2.2 # Prob 99% #22.4
The question which still remains is the following:
Why does the non-aligned, temporal version perform better than the non-temporal version for the CPU E5-2650 (see above). I've already looked at the generated assembly code and the compiler really generates vmovups instructions (due to the non existing alignment).