Here's a few things I see.
In both the
int
andshort
case, when you're storing the__m128
tosumDot
, you use_mm_storeu_si128
on targets that are much smaller than 128 bits. This means you've been corrupting memory, and were lucky you were not bitten.- Related to this, because
sumDot
is anint[1]
even in theshort
case, you were storing twoshort
s in oneint
, and then reading it as anint
.
- Related to this, because
In the
short
case you're missing one horizontal vector reduction step. Remember that now that you've got 8short
s per vector, you must now have log_2(8) = 3 vector reduction steps.vsum = _mm_hadd_epi16(vsum, vsum); vsum = _mm_hadd_epi16(vsum, vsum); vsum = _mm_hadd_epi16(vsum, vsum);
(Optional) Since you're onto SSE4.1 already, might as well use one of the goodies it has: The
PEXTR*
instructions. They take the index of the lane from which to extract. You're interested in the bottom lane (lane 0) because that's where the sum ends up after your vector reduction./* 32-bit */ sumDot[0] = _mm_extract_epi32(vsum, 0); /* 16-bit */ sumDot[0] = _mm_extract_epi16(vsum, 0);
EDIT: Apparently the compiler doesn't sign-extend the 16-bit word extracted with_mm_extract_epi16
. You must convince it to do so yourself./* 32-bit */ sumDot[0] = (int32_t)_mm_extract_epi32(vsum, 0); /* 16-bit */ sumDot[0] = (int16_t)_mm_extract_epi16(vsum, 0);
EDIT2: I found an even BETTER solution! It uses exactly the instruction we need (
PMADDWD
), and is identical to the 32-bit code except that the iteration bounds are different, and instead of_mm_mullo_epi16
you use_mm_madd_epi16
in the loop. This only needs two 32-bit vector reduction stages. http://pastebin.com/A9ibkMwP- (Optional) It is good style but will make no difference to use the
_mm_setzero_*()
functions instead of_mm_set1_*(0)
.