Question

The intrinsic:

int mask = _mm256_movemask_epi8(__m256i s1)

creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1. After manipulating the mask using bit operations (BMI2 for example) I would like to perform the inverse of _mm256_movemask_epi8, i.e., create a __m256i vector with the most significant bit of each byte containing the corresponding bit of the uint32_t mask.

What is the best way to do this?

Edit: I need to perform the inverse because the intrinsic _mm256_blendv_epi8 accepts only __m256i type mask instead of uint32_t. As such, in the resulting __m256i mask, I can ignore the bits other than the MSB of each byte.

Était-ce utile?

La solution 2

Here is an alternative to LUT or pdep instructions that might be more efficient:

  1. Copy your 32-bit mask to both low bytes of some ymm register and bytes 16..19 of the same register. You could use temporary array and _mm256_load_si256. Or you could move single copy of 32-bit mask to low bytes of some ymm register, then broadcast it with VPBROADCASTD (_mm_broadcastd_epi32) or other broadcast/shuffle instructions.
  2. Rearrange bytes of the register so that low 8 bytes (each) contain low 8 bits of your mask, next 8 bytes - next 8 bits, etc. This could be done with VPSHUFB (_mm256_shuffle_epi8) with control register containing '0' in low 8 bytes, '1' in next 8 bytes, etc.
  3. Select proper bit for each byte with VPOR (_mm256_or_si256) or VPAND (_mm256_and_si256).
  4. Set MSB of appropriate bytes with VPCMPEQB (_mm256_cmpeq_epi8). Compare each byte to 0xFF. If you want each bit of the mask toggled, use VPAND on previous step and compare to zero.

Additional flexibility of this approach is that you could choose different control register for step #2 and different mask for step #3 to shuffle bits of your bit mask (for example you could copy this mask to ymm register in reversed order).

Autres conseils

I have implemented the above three approaches on a Haswell machine. Evgeny Kluev's approach is the fastest (1.07 s), followed by Jason R's (1.97 s) and Paul R's (2.44 s). The code below was compiled with -march=core-avx2 -O3 optimization flags.

#include <immintrin.h>
#include <boost/date_time/posix_time/posix_time.hpp>

//t_icc = 1.07 s
//t_g++ = 1.09 s
__m256i get_mask3(const uint32_t mask) {
  __m256i vmask(_mm256_set1_epi32(mask));
  const __m256i shuffle(_mm256_setr_epi64x(0x0000000000000000,
      0x0101010101010101, 0x0202020202020202, 0x0303030303030303));
  vmask = _mm256_shuffle_epi8(vmask, shuffle);
  const __m256i bit_mask(_mm256_set1_epi64x(0x7fbfdfeff7fbfdfe));
  vmask = _mm256_or_si256(vmask, bit_mask);
  return _mm256_cmpeq_epi8(vmask, _mm256_set1_epi64x(-1));
}

//t_icc = 1.97 s
//t_g++ = 1.97 s
__m256i get_mask2(const uint32_t mask) {
  __m256i vmask(_mm256_set1_epi32(mask));
  const __m256i shift(_mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0));
  vmask = _mm256_sllv_epi32(vmask, shift);
  const __m256i shuffle(_mm256_setr_epi64x(0x0105090d0004080c,
      0x03070b0f02060a0e, 0x0105090d0004080c, 0x03070b0f02060a0e));
  vmask = _mm256_shuffle_epi8(vmask, shuffle);
  const __m256i perm(_mm256_setr_epi64x(0x0000000000000004, 0x0000000100000005,
      0x0000000200000006, 0x0000000300000007));
  return _mm256_permutevar8x32_epi32(vmask, perm);
}

//t_icc = 2.44 s
//t_g++ = 2.45 s
__m256i get_mask1(uint32_t mask) {
  const uint64_t pmask = 0x8080808080808080ULL; // bit unpacking mask for PDEP
  uint64_t amask0, amask1, amask2, amask3; 
  amask0 = _pdep_u64(mask, pmask);
  mask >>= 8;
  amask1 = _pdep_u64(mask, pmask);
  mask >>= 8;
  amask2 = _pdep_u64(mask, pmask);
  mask >>= 8;
  amask3 = _pdep_u64(mask, pmask);
  return _mm256_set_epi64x(amask3, amask2, amask1, amask0);
}

int main() {
  __m256i mask;
  boost::posix_time::ptime start(
      boost::posix_time::microsec_clock::universal_time()); 
  for(unsigned i(0); i != 1000000000; ++i)
    { 
      mask = _mm256_xor_si256(mask, get_mask3(i));
    }
  boost::posix_time::ptime end(
      boost::posix_time::microsec_clock::universal_time());
  std::cout << "duration:" << (end-start) << 
    " mask:" << _mm256_movemask_epi8(mask) << std::endl;
  return 0;
}

My initial approach to this was similar to @Jason R's because that is how "normal" operations work, but most of these operations only care about the high bit -- ignoring all the other bits. Once I realized this, the _mm*_maskz_broadcast*_epi*(mask,__m128i) series of functions made the most sense. You will need to enable -mavx512vl and -mavx512bw (gcc)

To get a vector with the highest bit of each byte set according to a mask:

/* convert 16 bit mask to __m128i control byte mask */
_mm_maskz_broadcastb_epi8((__mmask16)mask,_mm_set1_epi32(~0))
/* convert 32 bit mask to __m256i control byte mask */
_mm256_maskz_broadcastb_epi8((__mmask32)mask,_mm_set1_epi32(~0))
/* convert 64 bit mask to __m512i control byte mask */
_mm512_maskz_broadcastb_epi8((__mmask64)mask,_mm_set1_epi32(~0))

To get a vector with the highest bit of each word set according to a mask:

/* convert 8 bit mask to __m128i control word mask */
_mm_maskz_broadcastw_epi16((__mmask8)mask,_mm_set1_epi32(~0))
/* convert 16 bit mask to __m256i control word mask */
_mm256_maskz_broadcastw_epi16((__mmask16)mask,_mm_set1_epi32(~0))
/* convert 32 bit mask to __m512i control word mask */
_mm512_maskz_broadcastw_epi16((__mmask32)mask,_mm_set1_epi32(~0))

To get a vector with the highest bit of each double word set according to a mask:

/* convert 8 bit mask to __m256i control mask */
_mm256_maskz_broadcastd_epi32((__mmask8)mask,_mm_set1_epi32(~0))
/* convert 16 bit mask to __m512i control mask */
_mm512_maskz_broadcastd_epi32((__mmask16)mask,_mm_set1_epi32(~0))

To get a vector with the highest bit of each quad word set according to a mask:

/* convert 8 bit mask to __m512i control mask */
_mm512_maskz_broadcastq_epi64((__mmask8)mask,_mm_set1_epi32(~0))

The one specific to this question is: _mm256_maskz_broadcastb_epi8((__mmask32)mask,_mm_set1_epi32(~0)) but I include the others for reference/comparison.

Note that each byte/word/... will either be all ones or all zeroes according to the mask (not just the highest bit). This can also be useful for doing vectorized bit operations (&'ing with another vector for instance to zero out unwanted bytes/words).

Another note: each _mm_set1_epi32(~0) could/should be converted to a constant (either manually or by the compiler), so it should compile to just one fairly quick operation, though it may be slightly faster in testing than in real life since the constant will likely stay in a register. Then these are converted to VPMOVM2{b,w,d,q} instructions

Edit: In case your compiler doesn't support AVX512, the inline assembly version should look like:

inline __m256i dmask2epi8(__mmask32 mask){
  __m256i ret;
  __asm("vpmovm2b   %1, %0":"=x"(ret):"k"(mask):);
  return ret;
}

The other instructions are similar.

Here's another implementation that might work on AVX2 since you had that tag on your question (it is untested since I don't have a Haswell machine). It is similar to Evgeny Kluev's answer, but it might take fewer instructions. It requires two constant __m256i masks, though. If you're doing this many times in a loop, then the overhead of setting up those constants once ahead of time may be negligible.

  • Take your 32-bit mask and broadcast it to all 8 slots of a ymm register using _mm_broadcastd_epi32().

  • Create a __m256i holding 8 32-bit integers with values [0, 1, 2, 3, 4, 5, 6, 7] (from the least-significant to most-significant element).

  • Use that constant mask to rotate each of the 32-bit integers in your ymm register left by a different amount, using _mm256_sllv_epi32().

  • Now, if we view the ymm register as holding 8-bit integers and look at their MSBs, then the register now holds the MSBs for byte indices [7, 15, 23, 31, 6, 14, 22, 30, 5, 13, 21, 29, 4, 12, 20, 28, 3, 11, 19, 27, 2, 10, 18, 26, 1, 9, 17, 25, 0, 8, 16, 24] (from the least-significant to the most-significant element).

  • Use a bitwise-AND against a constant mask of [0x80, 0x80, 0x80, ...] to isolate the MSBs from each byte.

  • Use a sequence of shuffles and/or permutes to get the elements back in the order that you want. Unfortunately, there is no any-to-any permute for 8-bit integers like there are for floating-point values in AVX2.

The only reasonably efficient way I can think of is with an 8 bit LUT: do 4 x 8 bit lookups and then load the results into a vector, e.g.

static const uint64_t LUT[256] = { 0x0000000000000000ULL,
                                   ...
                                   0xffffffffffffffffULL };

uint64_t amask[4] __attribute__ ((aligned(32)));

uint32_t mask;
__m256i vmask;

amask[0] = LUT[mask & 0xff];
amask[1] = LUT[(mask >> 8) & 0xff];
amask[2] = LUT[(mask >> 16) & 0xff];
amask[3] = LUT[mask >> 24];
vmask = _mm256_load_si256((__m256i *)amask);

Alternatively you could use registers instead of the temporary array and see if your compiler can do something more efficient that doesn't involve going via memory:

static const uint64_t LUT[256] = { 0x0000000000000000ULL,
                                   ...
                                   0xffffffffffffffffULL };

uint64_t amask0, amask1, amask2, amask3;

uint32_t mask;
__m256i vmask;

amask0 = LUT[mask & 0xff];
amask1 = LUT[(mask >> 8) & 0xff];
amask2 = LUT[(mask >> 16) & 0xff];
amask3 = LUT[mask >> 24];
vmask = _mm256_set_epi64x(amask3, amask2, amask1, amask0);

Afterthought: an interesting challenge might be to use e.g. Haswell BMI instructions to perform the equivalent of the 8 -> 64 bit LUT operation and thereby get rid of the LUT. It looks like you could use PDEP for this, e.g.

const uint64_t pmask = 0x8080808080808080ULL; // bit unpacking mask for PDEP

uint64_t amask0, amask1, amask2, amask3;

uint32_t mask;
__m256i vmask;

amask0 = _pdep_u64(mask, pmask); mask >>= 8;
amask1 = _pdep_u64(mask, pmask); mask >>= 8;
amask2 = _pdep_u64(mask, pmask); mask >>= 8;
amask3 = _pdep_u64(mask, pmask);
vmask = _mm256_set_epi64x(amask3, amask2, amask1, amask0);
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top