You can vectorize the compares using _mm_cmplt_epi8 and _mm_cmpgt_epi8 (msvc intrinsics).
You can then use a movemask on the result of ANDing the compare results. If the result of the movemask is 0xFFFF then all comparisons passed. Otherwise you need to run the tail loop to find out the correct position that failed the test. You could figure this out from the mask, but depending upon the value of 'len' it may not be worth the effort.
The original unvectorized loop for the tail is also required if 'len' is not a multiple of 16. It may or may not be faster -- you'd need to profile it to be sure.
scrap that - the compares operate on signed values and it doesnt work..
Working version below.
union UmmU8 {
__m128i mm_;
struct {
unsigned char u8_;
};
};
int f(int len, unsigned char *p) {
int i = 0;
__m128i A;
__m128i B;
__m128i C;
UmmU8* pu = (UmmU8*)p;
int const len16 = len / 16;
while (i < len16) {
A = pu[i].mm_;
B = _mm_slli_epi32(A, 1);
C = _mm_slli_epi32(A, 2);
B = _mm_or_si128(B, C);
A = _mm_andnot_si128(A, B);
int mask = _mm_movemask_epi8(A);
if (mask == 0xFFFF) {
++i;
}
else {
if (mask == 0) {
return i * 16;
}
break;
}
}
i *= 16;
while (i < len && p[i] >= 32 && p[i] <= 127) {
i++;
}
return i;
}
Since I don't have a 64 OS on this PC, I cant do a proper perf test. However, a profiling run gave:
- naive loop: 30.44
- 64bit integer: 15.22 (on 32bit OS)
- SSE impl: 5.21
So the SSE version is a lot faster than the naive loop version. I'd expect the 64bit in version to perform significantly better on 64bit system - there may be little difference between the SSE and 64bit versions.