Mask and aggregate bits

Question

This operation is called compress_right or just compress, and it is moderately terrible to implement without hardware support. The non-naive code from Hacker's Delight "7–4 Compress, or Generalized Extract" to implement this function is

unsigned compress(unsigned x, unsigned m) {
    unsigned mk, mp, mv, t;
    int i;
    x = x & m;    // Clear irrelevant bits.
    mk = ~m << 1; // We will count 0's to right.
    for (i = 0; i < 5; i++) {
        mp = mk ^ (mk << 1);    // Parallel suffix.
        mp = mp ^ (mp << 2);
        mp = mp ^ (mp << 4);
        mp = mp ^ (mp << 8);
        mp = mp ^ (mp << 16);
        mv = mp & m;     // Bits to move.
        m = m ^ mv | (mv >> (1 << i)); // Compress m.
        t = x & mv;
        x = x ^ t | (t >> (1 << i));   // Compress x.
        mk = mk & ~mp;
    }
    return x;
}

BMI2 (implemented in Haswell and later) will have the instruction pext for this operation.

If the mask is a constant (or not a constant but reused multiple times), a relatively obvious optimization is pre-calculating the 5 values that mv takes during the loop. The calculation of mv does not depend on x, so that can be calculated independantly, like this (the same algorithm as above really)

mk = ~m << 1;
for (i = 0; i < 5; i++) {
    mp = mk ^ (mk << 1);
    mp = mp ^ (mp << 2);
    mp = mp ^ (mp << 4);
    mp = mp ^ (mp << 8);
    mp = mp ^ (mp << 16);
    mv = mp & m;
    mask[i] = mv;
    m = m ^ mv | (mv >> (1 << i));
    mk = mk & ~mp;
}

Still looks complicated, but everything here is a constant, so it can be pre-computed (if the compiler can't do it, then you can, simply by running it and then pasting the result into the code). The "real part" of the code, the code that actually has to run at runtime is this:

x = x & m;
t = x & mask[0];
x = x ^ t | (t >> 1);
t = x & mask[1];
x = x ^ t | (t >> 2);
t = x & mask[2];
x = x ^ t | (t >> 4);
t = x & mask[3];
x = x ^ t | (t >> 8);
t = x & mask[4];
x = x ^ t | (t >> 16);

(this is also in Hacker's Delight, formatted a little differently)

Many cases can be simpler again, for example:

if m = 0, the result is 0.
if m = -1, the result is x.
if m = 1, the result is x & 1.
if m = ((1 << n) - 1) << k, the result is (x >> k) & m.
if m = 0x80000000, the result is x >> 31.
if m is an other power of two, the result is (x >> numberOfTrailingZeros(m)) & 1
if m is alternating, the "perfect unshuffle algorithm" can be used.
if m consists of a few "groups", the "bit group moving" algorithm can be used (ie mask a group, shift it into place (or shift first, mask second), OR all shifted groups together, though more sophisticated approaches exist), this is probably the most important case in practice.
...

For example, the mask from your question would fall in the "bit group moving" case, with code like this:

return ((x >> 1) & 1) | ((x >> 3) & 6);