Here are a few things:
use
__m128i
instead of__m128
you can zero-initialize
vsum
using__m128i vsum = _mm_setzero_si128()
;for data load, cast to the proper __m128i type and use the packed-load version (_mm_loadl_epi64 only loads one 64-bits integer). so, either
for (int i = 0; i < n; i += 2) { // 2 uint64 in single __m128i __m128i v = _mm_loadu_si128(reinterpret_cast<__m128i*>(&a[i]));
or
__m128i* pa = reinterpret_cast<__m128i*>(a); for (int i = 0; i < n; i += 2) { // 2 uint64 in single __m128i __m128i v = _mm_loadu_si128(pa); pa++;
finally you may be able to assign to sum using
sum = vsum.m128i_u64[0] + vsum.m128i_u64[1];
if there's a union defined for it (there is under windows/Visual-Studio, but you are using a different environment).