Question

I need a function of an __m128i variable that has period 2^128. It doesn't need to monotonically increase (like a counter), but visit each value once.

The simplest example I could think of is in fact a 128-bit counter, but I found that difficult to implement in SSE. Are there any simpler/faster solutions?

Was it helpful?

Solution

Here's a monotonic counter. I'm not sure if you can call it simple though.

Assuming both ONE and ZERO are always in registers, then this should compile to 5 instructions. (7 or 8 if VEX-encoding is not used)

inline __m128i nextc(__m128i x){
    const __m128i ONE = _mm_setr_epi32(1,0,0,0);
    const __m128i ZERO = _mm_setzero_si128();

    x = _mm_add_epi64(x,ONE);
    __m128i t = _mm_cmpeq_epi64(x,ZERO);
    t = _mm_and_si128(t,ONE);
    t = _mm_unpacklo_epi64(ZERO,t);
    x = _mm_add_epi64(x,t);

    return x;
}

Test Code (MSVC):

int main() {

    __m128i x = _mm_setr_epi32(0xfffffffa,0xffffffff,1,0);

    int c = 0;
    while (c++ < 10){
        cout << x.m128i_u64[0] << "  " << x.m128i_u64[1] << endl;
        x = nextc(x);
    }

    return 0;
}

Output:

18446744073709551610  1
18446744073709551611  1
18446744073709551612  1
18446744073709551613  1
18446744073709551614  1
18446744073709551615  1
0  2
1  2
2  2
3  2

Slightly better version suggested by @Norbert P. It saves 1 instruction over my original solution.

inline __m128i nextc(__m128i x){
    const __m128i ONE = _mm_setr_epi32(1,0,0,0);
    const __m128i ZERO = _mm_setzero_si128();

    x = _mm_add_epi64(x,ONE);
    __m128i t = _mm_cmpeq_epi64(x,ZERO);
    t = _mm_unpacklo_epi64(ZERO,t);
    x = _mm_sub_epi64(x,t);

    return x;
}

OTHER TIPS

Never forget the KISS principle.

Pasting this (unsigned integers are required to wrap around by the C standard, hence visiting each value only once):

__uint128_t inc(__uint128_t x) {
  return x+1;
}

into this yields (for x64):

    addq    $1, %rdi
    adcq    $0, %rsi
    movq    %rdi, %rax
    movq    %rsi, %rdx
    ret

easy/fast enough? If you inline that, you'll probably be able to get away with just the addq/adcq (the movqs and ret are required by the x64 ABI: if you inline the function, they are not required)


To address Voo's comment about the suckiness of MSVC, you can use the following:

inline void inc(unsigned long long *x, unsigned long long *y) {
  if (!++*x) ++*y; // yay for obfuscation!
}

I don't have a MSVC install nearby so I can't test it, but it should yield something similar to what I posted above. Then, if you really need a __m128i, you should be able to cast the two halves.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top