Question

I'm trying to efficiently execute the following task:

INPUT VALUE: 01101011
MASK:        00110010
MASK RESULT: --10--1-
AGGREGATED:  00000101

I hope this examples explains clearly what I'm trying to achieve. What's the best way to do this in a non-naive way?

Was it helpful?

Solution

This operation is called compress_right or just compress, and it is moderately terrible to implement without hardware support. The non-naive code from Hacker's Delight "7–4 Compress, or Generalized Extract" to implement this function is

unsigned compress(unsigned x, unsigned m) {
    unsigned mk, mp, mv, t;
    int i;
    x = x & m;    // Clear irrelevant bits.
    mk = ~m << 1; // We will count 0's to right.
    for (i = 0; i < 5; i++) {
        mp = mk ^ (mk << 1);    // Parallel suffix.
        mp = mp ^ (mp << 2);
        mp = mp ^ (mp << 4);
        mp = mp ^ (mp << 8);
        mp = mp ^ (mp << 16);
        mv = mp & m;     // Bits to move.
        m = m ^ mv | (mv >> (1 << i)); // Compress m.
        t = x & mv;
        x = x ^ t | (t >> (1 << i));   // Compress x.
        mk = mk & ~mp;
    }
    return x;
}

BMI2 (implemented in Haswell and later) will have the instruction pext for this operation.


If the mask is a constant (or not a constant but reused multiple times), a relatively obvious optimization is pre-calculating the 5 values that mv takes during the loop. The calculation of mv does not depend on x, so that can be calculated independantly, like this (the same algorithm as above really)

mk = ~m << 1;
for (i = 0; i < 5; i++) {
    mp = mk ^ (mk << 1);
    mp = mp ^ (mp << 2);
    mp = mp ^ (mp << 4);
    mp = mp ^ (mp << 8);
    mp = mp ^ (mp << 16);
    mv = mp & m;
    mask[i] = mv;
    m = m ^ mv | (mv >> (1 << i));
    mk = mk & ~mp;
}

Still looks complicated, but everything here is a constant, so it can be pre-computed (if the compiler can't do it, then you can, simply by running it and then pasting the result into the code). The "real part" of the code, the code that actually has to run at runtime is this:

x = x & m;
t = x & mask[0];
x = x ^ t | (t >> 1);
t = x & mask[1];
x = x ^ t | (t >> 2);
t = x & mask[2];
x = x ^ t | (t >> 4);
t = x & mask[3];
x = x ^ t | (t >> 8);
t = x & mask[4];
x = x ^ t | (t >> 16);

(this is also in Hacker's Delight, formatted a little differently)

Many cases can be simpler again, for example:

  • if m = 0, the result is 0.
  • if m = -1, the result is x.
  • if m = 1, the result is x & 1.
  • if m = ((1 << n) - 1) << k, the result is (x >> k) & m.
  • if m = 0x80000000, the result is x >> 31.
  • if m is an other power of two, the result is (x >> numberOfTrailingZeros(m)) & 1
  • if m is alternating, the "perfect unshuffle algorithm" can be used.
  • if m consists of a few "groups", the "bit group moving" algorithm can be used (ie mask a group, shift it into place (or shift first, mask second), OR all shifted groups together, though more sophisticated approaches exist), this is probably the most important case in practice.
  • ...

For example, the mask from your question would fall in the "bit group moving" case, with code like this:

return ((x >> 1) & 1) | ((x >> 3) & 6);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top