Question

I'm trying to optimize the following function (simplified a bit, but it's the loop where my program spends a lot of its time):

int f(int len, unsigned char *p) {
  int i = 0;
  while (i < len && p[i] >= 32 && p[i] <= 127) {
      i++;
  }
  return i;
}

I think it could be optimized with vector instructions, but from a bit of research, it looks like SSE is not meant for working at the byte level. This program targets only 64-bit Intel CPUs on OSX. Is there a clever bit-twiddling trick I'm not seeing that would let me work on 64 bits at a time? llvm with -O3 doesn't do any clever optimizations.

UPDATE:

The SIMD code is generally fastest on my benchmark (depending on the size of the input), but for some reason the app overall is slower with SIMD than with the naïve code or the bit-twiddling trick. For context, the application is finding the length of subsequences of ASCII strings in a stream of input to a terminal emulator. ASCII strings get special "fast path" treatment. I could only mark one answer as correct but both were great. I did make one small improvement to the bit twiddling, removing an if statement by doing this:

        while (i < len - 8) {
            uint64_t bytes = *(uint64_t *)(p + i);
            uint64_t middleBits = bytes & 0x6060606060606060;
            uint64_t highBits = bytes & 0x8080808080808080;
            middleBits |= (middleBits >> 1);
            middleBits &= ~(highBits >> 2);
            if ((middleBits & 0x2020202020202020) != 0x2020202020202020) {
                break;
            }
            i += 8;
        }
Était-ce utile?

La solution 2

You can vectorize the compares using _mm_cmplt_epi8 and _mm_cmpgt_epi8 (msvc intrinsics).

You can then use a movemask on the result of ANDing the compare results. If the result of the movemask is 0xFFFF then all comparisons passed. Otherwise you need to run the tail loop to find out the correct position that failed the test. You could figure this out from the mask, but depending upon the value of 'len' it may not be worth the effort.

The original unvectorized loop for the tail is also required if 'len' is not a multiple of 16. It may or may not be faster -- you'd need to profile it to be sure.

scrap that - the compares operate on signed values and it doesnt work..

Working version below.

union UmmU8 {
    __m128i mm_;
    struct {
        unsigned char u8_;
    };
};

int f(int len, unsigned char *p) {
    int i = 0;
    __m128i A;
    __m128i B;
    __m128i C;
    UmmU8* pu = (UmmU8*)p;    
    int const len16 = len / 16;
    while (i < len16) {
        A = pu[i].mm_;
        B = _mm_slli_epi32(A, 1);
        C = _mm_slli_epi32(A, 2);
        B = _mm_or_si128(B, C);
        A = _mm_andnot_si128(A, B);

        int mask = _mm_movemask_epi8(A);
        if (mask == 0xFFFF) {
            ++i;
        }
        else {
            if (mask == 0) {
                return i * 16;
            }
            break;
        }
    }
    i *= 16;
    while (i < len && p[i] >= 32 && p[i] <= 127) {
        i++;
    }
    return i;
}

Since I don't have a 64 OS on this PC, I cant do a proper perf test. However, a profiling run gave:

  • naive loop: 30.44
  • 64bit integer: 15.22 (on 32bit OS)
  • SSE impl: 5.21

So the SSE version is a lot faster than the naive loop version. I'd expect the 64bit in version to perform significantly better on 64bit system - there may be little difference between the SSE and 64bit versions.

Autres conseils

I am not sure if this is the answer to your question nor if this will considerably speed up your code, but this is an idea coming on my mind. Since 32 equals to 2^5, if a byte is between 32 and 128 it must have 6th or 7th bit set and 8th bit cleared. You can extend testing to 64-bit integers giving me a code like:

// check whether each byte is in range 32 - 128.
unsigned bytesInRange(unsigned long long x) {
    unsigned long long y, z;
    if ((x & 0x8080808080808080LL) != 0) return(0);
    y = x >> 1;
    z = x | y;
    if ((z & 0x2020202020202020LL) == 0x2020202020202020LL) return(1);
    return(0);
}

int f(int len, unsigned char *p) {
  int i = 0;
  int len8 = len / 8;
  unsigned long long *q = (unsigned long long *) p;
  while (i < len8 && bytesInRange(q[i])) {
    i++;
  }

  i = i * 8;
  while (i < len && p[i] >= 32 && p[i] <= 127) {
    i++;
  }
  return i;
}

For architectures where alignment is required it needs to be checked before the first loop.

I have tried several approaches to the problem: based on SSE2 and SSE4.2. String operations from SSE4.2 are rather slow, SSE2 versions outperform them easily. Note that in general the best solution heavily depends on the average magnitude of expected answer.

Here is one of the best performing solutions for answer <= 400 approximately:

//SSE2 vectorization by stgatilov: no unrolling, fast BSF tail
int CommonAsciiLength_sse2_end(int len, unsigned char *p) {
  const __m128i *ptr = (const __m128i *)p;
  int blocks = len >> 4;

  int cnt;
  for (cnt = 0; cnt < blocks; cnt++) {
    __m128i mask = _mm_cmplt_epi8(ptr[cnt], _mm_set1_epi8(32));
    int val = _mm_movemask_epi8(mask);
    if (val)
      return 16 * cnt + __builtin_ctz(val);
  }
  __m128i mask = _mm_cmplt_epi8(ptr[cnt], _mm_set1_epi8(32));
  int val = _mm_movemask_epi8(mask);
  val |= -(1 << (len - 16 * cnt));
  return 16 * cnt + __builtin_ctz(val);
}

Note that for larger answers this solution further benefits from unrolling.

Here are some timings for different solutions and different answer lengths. Measured on Ivy Bridge. Note that it makes sense only to compare timings within a single run, comparing across runs with different avg. answer is incorrect.

All checked.
Average answer = 7.0
Time = 4.879   (1884680192) original
Time = 6.021   (1884680192) bitmask
Time = 5.205   (1884680192) Pete
Time = 5.094   (1884680192) sse2
Time = 5.301   (1884680192) sse2_x4
Time = 1.603   (1884680192) sse42
Time = 1.235   (1884680192) sse2_end
Time = 2.319   (1884680192) sse2_x4_end
=========================================
All checked.
Average answer = 47.0
Time = 5.825   (-1867343006) original
Time = 4.792   (-1867343006) bitmask
Time = 4.490   (-1867343006) Pete
Time = 4.327   (-1867343006) sse2
Time = 5.260   (-1867343006) sse2_x4
Time = 3.347   (-1867343006) sse42
Time = 2.505   (-1867343006) sse2_end
Time = 3.008   (-1867343006) sse2_x4_end
=========================================
All checked.
Average answer = 151.4
Time = 4.372   (-2086294174) original
Time = 2.150   (-2086294174) bitmask
Time = 1.662   (-2086294174) Pete
Time = 1.492   (-2086294174) sse2
Time = 2.249   (-2086294174) sse2_x4
Time = 1.649   (-2086294174) sse42
Time = 0.986   (-2086294174) sse2_end
Time = 1.398   (-2086294174) sse2_x4_end
=========================================
All checked.
Average answer = 426.8
Time = 3.772   (1814680269) original
Time = 1.320   (1814680269) bitmask
Time = 0.830   (1814680269) Pete
Time = 0.692   (1814680269) sse2
Time = 0.870   (1814680269) sse2_x4
Time = 1.186   (1814680269) sse42
Time = 0.531   (1814680269) sse2_end
Time = 0.573   (1814680269) sse2_x4_end
=========================================
All checked.
Average answer = 1083.4
Time = 2.788   (358018991) original
Time = 0.819   (358018991) bitmask
Time = 0.443   (358018991) Pete
Time = 0.344   (358018991) sse2
Time = 0.347   (358018991) sse2_x4
Time = 0.813   (358018991) sse42
Time = 0.297   (358018991) sse2_end
Time = 0.256   (358018991) sse2_x4_end

Full code of all solutions, along with testing, is available here.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top