int vs short vectorization

Question

Here's a few things I see.

In both the int and short case, when you're storing the __m128 to sumDot, you use _mm_storeu_si128 on targets that are much smaller than 128 bits. This means you've been corrupting memory, and were lucky you were not bitten.
- Related to this, because sumDot is an int[1] even in the short case, you were storing two shorts in one int, and then reading it as an int.
In the short case you're missing one horizontal vector reduction step. Remember that now that you've got 8 shorts per vector, you must now have log_2(8) = 3 vector reduction steps.
```
vsum = _mm_hadd_epi16(vsum, vsum);
vsum = _mm_hadd_epi16(vsum, vsum);
vsum = _mm_hadd_epi16(vsum, vsum);
```
(Optional) Since you're onto SSE4.1 already, might as well use one of the goodies it has: The PEXTR* instructions. They take the index of the lane from which to extract. You're interested in the bottom lane (lane 0) because that's where the sum ends up after your vector reduction.

~~/* 32-bit */ sumDot[0] = _mm_extract_epi32(vsum, 0); /* 16-bit */ sumDot[0] = _mm_extract_epi16(vsum, 0);~~

EDIT: Apparently the compiler doesn't sign-extend the 16-bit word extracted with _mm_extract_epi16. You must convince it to do so yourself.
```
/* 32-bit */
sumDot[0] = (int32_t)_mm_extract_epi32(vsum, 0);
/* 16-bit */
sumDot[0] = (int16_t)_mm_extract_epi16(vsum, 0);
```
EDIT2: I found an even BETTER solution! It uses exactly the instruction we need (PMADDWD), and is identical to the 32-bit code except that the iteration bounds are different, and instead of _mm_mullo_epi16 you use _mm_madd_epi16 in the loop. This only needs two 32-bit vector reduction stages. http://pastebin.com/A9ibkMwP
(Optional) It is good style but will make no difference to use the _mm_setzero_*() functions instead of _mm_set1_*(0).