Question

I have four 2-bit bitfields stored in a single byte. Each bitfield can thus represent 0, 1, 2, or 3. For example, here are the 4 possible values where the first 3 bitfields are zero:

00 00 00 00 = 0 0 0 0
00 00 00 01 = 0 0 0 1
00 00 00 10 = 0 0 0 2
00 00 00 11 = 0 0 0 3

I'd like an efficient way to sum the four bitfields. For example:

11 10 01 00 = 3 + 2 + 1 + 0 = 6

A 8-bit lookup table on a modern Intel x64 CPU takes 4 cycles to return an answer from L1. It seems like there should be some way to calculate the answer faster than that. 3 cycles gives about room for 6-12 simple bit operations. As a starter, the straightforward mask and shift look like it will take 5 cycles on Sandy Bridge:

Assuming the bit fields are: d c b a, and that mask is: 00 00 00 11

Clarifying with help from Ira: This presumes that a, b, c, and d are identical and have all been set to the initial byte. Oddly, I think I can do this for free. Since I can do 2 loads per cycle, instead of loading byte once, I can just load it four times: a and d on the first cycle, b and c on the second. The second two loads will be delayed one cycle, but I don't need them until the second cycle. The splits below represent how things should break into separate cycles.

a = *byte
d = *byte

b = *byte
c = *byte

latency

latency

a &= mask
d >>= 6

b >>= 2
c >>= 4
a += d

b &= mask
c &= mask

b += c

a += b

A different encoding for the bitfields to make the logic easier would actually be fine, so long as it fits within a single byte and maps one-to-one with this scheme somehow. Dropping to assembly is also fine. Current target is Sandy Bridge, but targeting Haswell or beyond is fine too.

Application and motivation: I'm trying to make an open-source variable bit decompression routine go faster. Each bitfield represents the compressed length of each of the four integers that follow. I need the sum to know how many bytes I need to jump to get to the next group of four. The current loop takes 10 cycles, with 5 of that coming the lookup I'm trying to avoid. Shaving off a cycle would be ~10% improvement.

Edit:

Originally I said "8 cycles", but as Evgeny points out below I was wrong. As Evgeny points out, the only time there is an indirect 4 cycle load is if loading from the first 2K of system memory without using an index register. A correct list of latencies can be found in the Intel Architecture Optimization Manual Section 2.12

>    Data Type       (Base + Offset) > 2048   (Base + Offset) < 2048 
>                     Base + Index [+ Offset]
>     Integer                5 cycles               4 cycles
>     MMX, SSE, 128-bit AVX  6 cycles               5 cycles
>     X87                    7 cycles               6 cycles 
>     256-bit AVX            7 cycles               7 cycles

Edit:

I think this is how Ira's solution below would break into cycles. I think it also takes 5 cycles of work post load.

a = *byte
b = *byte

latency

latency 

latency

a &= 0x33
b >>= 2

b &= 0x33
c = a

a += b
c += b

a &= 7
c >>= 4

a += c 
Was it helpful?

Solution 2

Other answers propose various means to add together values sitting in single variable (without unpacking them). While these approaches give pretty good throughput (with POPCNT in particular), they have large latency - either because long computation chains or because high-latency instructions used.

It may be better to use normal addition instructions (adding together one pair of values at once), use simple operations like masks and shifts to split these values one from another, and use instruction-level parallelism to do this efficiently. Also the position of two middle values in the byte hints for a variant of table lookup that uses a single 64-bit register instead of memory. All this allows to speed up calculation for sum of the four and use only 4 or 5 clocks.

Original table lookup approach suggested in OP may consist of the following steps:

  1. load byte with four values from memory (5 clocks)
  2. calculate sum of the values using lookup table (5 clocks)
  3. update pointer (1 clock)

64-bit register lookup

The following snippet shows how to perform step #2 in 5 clocks and also combine steps #2 and #3 keeping latency still at 5 clocks (which could be optimized to 4 clocks with complex addressing mode for memory load):

p += 5 + (*p & 3) + (*p >> 6) +
  ((0x6543543243213210ull >> (*p & 0x3C)) & 0xF);

Here constant "5" means that we skip current byte with lengths as well as 4 data bytes corresponding to all-zero lengths. This snippet corresponds to the following code (64-bit only):

mov eax, 3Ch
and eax, ebx              ;clock 1
mov ecx, 3
and ecx, ebx              ;clock 1
shr ebx, 6                ;clock 1
add ebx, ecx              ;clock 2
mov rcx, 6543543243213210h
shr rcx, eax              ;clock 2..3
and ecx, Fh               ;clock 4
add rsi, 5
add rsi, rbx              ;clock 3 or 4
movzx ebx, [rsi + rcx]    ;clock 5..9
add rsi, rcx

I tried to produce this code automatically with following compilers: gcc 4.6.3, clang 3.0, icc 12.1.0. First two of them didn't do anything good. But Intel's compiler did the work almost perfectly.


Fast bitfield extraction with ROR instruction

Edit: Nathan's tests reveal a problem with following approach. ROR instruction on Sandy Bridge uses two ports and conflicts with SHR instruction. So this code needs 1 more clock on Sandy Bridge which makes it not very useful. Probably it would work as expected on Ivy Bridge and Haswell.

It is not necessary to use the trick with 64-bit register as a lookup table. Instead you could just rotate the byte by 4 bits which places two middle values to the positions of the first and the fourth values. Then you could process them the same way. This approach has at least one disadvantage. It is not so easy to express byte rotation in C. Also I'm not quite sure about this rotation because on older processors it may result in partial register stall. Optimization manual hints that for Sandy Bridge we could update part of the register if operation source is the same as destination, without the stall. But I'm not sure I understood it properly. And I have no proper hardware to check this. Anyway, here is the code (now it may be either 32-bit or 64-bit):

mov ecx, 3
and ecx, ebx              ;clock 1
shr ebx, 6                ;clock 1
add ebx, ecx              ;clock 2
ror al, 4                 ;clock 1
mov ecx, 3
and ecx, eax              ;clock 2
shr eax, 6                ;clock 2
add eax, ecx              ;clock 3
add esi, 5
add esi, ebx              ;clock 3
movzx ebx, [esi+eax]      ;clocks 4 .. 8
movzx eax, [esi+eax]      ;clocks 4 .. 8
add esi, eax

Using boundary between AL and AH to unpack bitfields

This method differs from the previous one only in the way two middle bitfields are extracted. Instead of ROR, which is expensive on Sandy Bridge, a simple shift is used. This shift positions second bitfield in register AL and third bitfield - in AH. Then they are extracted with shifts/masks. Like in previous method, there are possibilities for partial register stall here, now in two instructions instead of one. But it's very likely Sandy Bridge and newer processors can execute them without delay.

mov ecx, 3
and ecx, ebx              ;clock 1
shr ebx, 6                ;clock 1
add ebx, ecx              ;clock 2
shl eax, 4                ;clock 1
mov edx, 3
and dl, ah                ;clock 2
shr al, 6                 ;clock 2
add dl, al                ;clock 3
add esi, 5
add esi, ebx              ;clock 3
movzx ebx, [esi+edx]      ;clock 4..8
movzx eax, [esi+edx]      ;clock 4..8
add esi, edx

Load and compute sum in parallel

Also it is not necessary to load 4-lengths byte and to calculate the sum sequentially. You could do all these operations in parallel. There are only 13 values for sum of the four. If your data is compressible, you'll rarely see this sum greater than 7. Which means instead of loading a single byte, you could load first 8 most likely bytes to 64-bit register. And you could do it earlier than computing sum of the four. 8 values are loaded while the sum is calculated. Then you just get proper value from this register with shift and mask. This idea may be used together with any means for computing the sum. Here it is used with simple table lookup:

typedef unsigned long long ull;
ull four_lengths = *p;
for (...)
{
  ull preload = *((ull*)(p + 5));
  unsigned sum = table[four_lengths];
  p += 5 + sum;

  if (sum > 7)
    four_lengths = *p;
  else
    four_lengths = (preload >> (sum*8)) & 15;
}

With proper assembly code this adds only 2 clocks to latency: shift and mask. Which gives 7 clocks (but only on compressible data).

If you change table lookup to computations, you may get loop latency of only 6 clocks: 4 to add together values and update the pointer, and 2 for shift and mask. Interestingly, in this case loop latency is determined only by computations and does not depend on latency for memory load.


Load and compute sum in parallel (deterministic approach)

Performing load and summation in parallel may be made in deterministic way. Loading two 64-bit registers and then selecting one of them with CMP+CMOV is one possibility, but it does not improve performance over sequential computation. Other possibility is using 128-bit registers and AVX. Migrating data between 128-bit registers and GPR/memory adds significant amount of latency (but half of this latency may be removed if we process two data blocks per iteration). Also we'll need to use byte-aligned memory loads to AVX registers (which also adds to loop latency).

The idea is to perform all computations in AVX except for memory load which should be done from GPR. (There is an alternative to do everything in AVX and use broadcast+add+gather on Haswell, but it's unlikely to be faster). Also it should be helpful to alternate data load to a pair of AVX registers (to process two data blocks per iteration). This allows pairs of load operations to overlap partially and cancels out half of additional latency.

Start with unpacking proper byte from register:

vpshufb xmm0, xmm6, xmm0      ; clock 1

Add together four bitfields:

vpand xmm1, xmm0, [mask_12]   ; clock 2 -- bitfields 1,2 ready
vpand xmm2, xmm0, [mask_34]   ; clock 2 -- bitfields 3,4 (shifted)
vpsrlq xmm2, xmm2, 4          ; clock 3 -- bitfields 3,4 ready
vpshufb xmm1, xmm5, xmm1      ; clock 3 -- sum of bitfields 1 and 2
vpshufb xmm2, xmm5, xmm2      ; clock 4 -- sum of bitfields 3 and 4
vpaddb xmm0, xmm1, xmm2       ; clock 5 -- sum of all bitfields

Then update address and load next vector of bytes:

vpaddd xmm4, xmm4, [min_size]
vpaddd xmm4, xmm4, xmm1       ; clock 4 -- address + 5 + bitfields 1,2
vmovd esi, xmm4               ; clock 5..6
vmovd edx, xmm2               ; clock 5..6
vmovdqu xmm6, [esi + edx]     ; clock 7..12

Then repeat the same code once more, only using xmm7 instead of xmm6. While xmm6 is loaded we may process xmm7.

This code uses several constants:

min_size = 5, 0, 0, ...
mask_12 = 0x0F, 0, 0, ...
mask_34 = 0xF0, 0, 0, ...
xmm5 = lookup table to add together two 2-bit values

Loop implemented as described here needs 12 clocks to complete and 'jumps' two data blocks at once. Which means 6 cycles per data block. This number may be too optimistic. I'm not pretty sure that MOVD needs only 2 clocks. Also it's not clear what is latency of MOVDQU instruction performing unaligned memory load. I suspect that MOVDQU has very high latency when data crosses cache-line boundary. I suppose this means something like 1 additional clock of latency on average. So about 7 cycles per data block is more realistic estimation.


Using brute force

Jumping just one or two data blocks per iteration is convenient but does not fully use resources of modern processors. After some pre-processing we may implement jumping straight to first data block in the next aligned 16 bytes of data. Pre-processing should read the data, compute sum of the four fields for each byte, use this sum to compute "links" to the next four-byte fields, and finally follow these "links" up to the next aligned 16-byte block. All these computations are independent and may be computed in any order using SSE/AVX instruction set. AVX2 would do pre-processing two times faster.

  1. Load 16 or 32 byte data block with MOVDQA.
  2. Add together 4 bitfields of each byte. To do this, extract high and low 4-bit nibbles with two PAND instructions, shift high nibble with PSRL*, find sum of each nibble with two PSHUFB, and add two sums with PADDB. (6 uops)
  3. Use PADDB to compute "links" to the next four-field bytes: add constants 0x75, 0x76, ... to the bytes of XMM/YMM register. (1 uop)
  4. Follow the "links" with PSHUFB and PMAXUB (more expensive alternative to PMAXUB is combination of PCMPGTB and PBLENDVB). VPSHUFB ymm1, ymm2, ymm2 does almost all the work. It substitutes "out-of-bound" values with zero. Then VPMAXUB ymm2, ymm1, ymm2 restores original "links" in place of these zeros. Two iterations is enough. After each iteration distance for each "link" is twice as large, so we need only log(longest_chain_length) iterations. For example, the most lengthy chain 0->5->10->15->X will be compressed to 0->10->X after one step and to 0->X after two steps. (4 uops)
  5. Subtract 16 from each byte with PSUBB, and (for AVX2 only) extract high 128 bits to a separate XMM register with VEXTRACTI128. (2 uops)
  6. Now pre-processing is finished. We may follow the "links" to first data block in the next 16-byte piece of data. This might be done with PCMPGTB, PSHUFB, PSUBB, and PBLENDVB. But if we assign range 0x70 .. 0x80 for possible "link" values, a single PSHUFB will do the work properly (actually a pair of PSHUFB, in case of AVX2). Values 0x70 .. 0x7F select proper byte from the next 16-byte register while value 0x80 would skip next 16 bytes and load byte 0, which is exactly what is needed. (2 uops, latency = 2 clocks)

Instructions for these 6 steps do not need to be ordered sequentially. For example, instructions for steps 5 and 2 may stand next to each other. Instructions for each step should process 16/32-byte blocks for different stages of pipeline, like this: step 1 processes block i, step 2 processes block i-1, steps 3,4 process block i-2, etc.

The whole loop's latency might be 2 clocks (per 32 bytes of data). But limiting factor here is throughput, not latency. When AVX2 is used we need to execute 15 uops, which means 5 clocks. If data is not compressible and data blocks are large, this gives about 3 clocks per data block. If data is compressible and data blocks are small, this gives about 1 clock per data block. (But since MOVDQA latency is 6 clocks, to get 5 clocks per 32 bytes we need two overlapping loads and to process twice as much data in each loop).

Pre-processing steps are independent of step #6. So they may be performed in different threads. This may decrease time per 32 bytes of data below 5 clocks.

OTHER TIPS

Would a builtin POPCOUNT instruction help?

n = POPCOUNT(byte&0x55);
n+= 2*POPCOUNT(byte&0xAA)

Or maybe

  word = byte + ((byte&0xAA) << 8);
  n = POPCOUNT(word);

Not sure about the total timing. This discussion says popcount has 3 cycles latency, 1 throughput.


UPDATE:
I may be missing some important fact about how to run IACA, but after a few experiments in the 12-11 throughput range, I compiled the following:

 uint32_t decodeFast(uint8_t *in, size_t count) {
  uint64_t key1 = *in;
  uint64_t key2;
  size_t adv;
  while (count--){
     IACA_START;
     key2=key1&0xAA;
     in+= __builtin_popcount(key1);
     adv= __builtin_popcount(key2);
     in+=adv+4;
     key1=*in;
  }
  IACA_END;
  return key1;
}

with gcc -std=c99 -msse4 -m64 -O3 test.c

and got 3.55 cycles !?!:

Block Throughput: 3.55 Cycles       Throughput Bottleneck: InterIteration
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    |           | 1.0 |           |           |     |     |    | popcnt edx,eax
|   1    | 0.9       |     |           |           |     | 0.1 | CP | and eax,0x55 
|   1    |           | 1.0 |           |           |     |     | CP | popcnt eax,eax
|   1    | 0.8       |     |           |           |     | 0.2 |    | movsxd rdx,edx
|   1    | 0.6       |     |           |           |     | 0.4 |    | add rdi, rdx
|   1    | 0.1       | 0.1 |           |           |     | 0.9 | CP | cdqe 
|   1    | 0.2       | 0.3 |           |           |     | 0.6 |    | sub rsi, 1
|   1    | 0.2       | 0.8 |           |           |     |     | CP | lea rdi,[rdi+rax+4] 
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     | CP | movzx eax,[rdi]
|   1    |           |     |           |           |     | 1.0 |    | jnz 0xffff

Two more ideas

A possible Micro-optimization to do the sum in 2 instructions

total=0;
PDEP(vals,0x03030303,*in);  #expands the niblets into bytes
PSADBW(total,vals) #total:= sum of abs(0-byte) for each byte in vals

Latency of each is supposedly 3, so this may not help. Maybe the byte-wise summation addition could be replaced with simple shifts and adds along the lines of AX=total+total>>16; ADD AL,AH

A macro-optimization:
You mention using the key as a lookup into a table of shuffle instructions. Why not just store the distance to the next key along with the shuffle instruction? Either store a bigger table, or possibly squeeze the 4 bit length into unused bits 3-6 of the shuffle key, at the expense of needing a mask to extract it.

Consider

 temp = (byte & 0x33) + ((byte >> 2) & 0x33);
 sum = (temp &7) + (temp>>4);

Should be 9 machine instructions, many of them executed in parallel. (OP's first try is 9 instructions plus some moves not mentioned).

On inspection, this seems to have too many serial dependencies to be a win.

EDIT: The discussion about binary-ops being destructive, and LEA avoiding this, got me to thinking about how to use LEA to combine more than one operand, and multiplication by constants. The above code tries to right normalize the answer by shifting right, but we can left-normalize the answer by multiplying. With that insight, this code might work:

     mov     ebx,  byte      ; ~1: gotta start somewhere
     mov     ecx, ebx        ; ~2: = byte
     and     ebx, 0xCC       ; ~3: 2 sets of 2 bits, with zeroed holes
     and     ecx, 0x33       ; ~3: complementary pair of bits
     lea     edx, [ebx+4*ecx] ; ~4: sum bit pairs, forming 2 4-bit sums
     lea     edi, [8*edx+edx] ; ~5: need 16*(lower bits of edx)
     lea     edi, [8*edx+edi] ; ~6: edi = upper nibble + 16* lower nibble
     shr     edi, 4           ; ~7: right normalized
     and     edi, 0x0F        ; ~8: masked

Well, entertaining but still didn't pan out. 3 clocks isn't very long :-{

I have no idea how many cycles it could take, and I might be completely off, but it is possible to sum with 5 simple operations using 32bits multiplications:

unsigned int sum = ((((byte * 0x10101) & 0xC30C3) * 0x41041) >> 18) & 0xF;

The first multiplication repeats the bit pattern

abcdefgh -> abcdefghabcdefghabcdefgh

The first bitand keeps a pair every 6 bits:

abcdefghabcdefghabcdefgh -> 0000ef0000cd0000ab0000gh

The second multiplication sums the bits pattern (only yyyy is of interest)

                     0000ef0000cd0000ab0000gh
             + 0000ef0000cd0000ab0000gh000000
       + 0000ef0000cd0000ab0000gh000000000000
 + 0000ef0000cd0000ab0000gh000000000000000000
 --------------------------------------------
   ..................00yyyy00................

The last 2 ops shift the yyyy to the right and cut the left part

The main problem is that ops are sequential though...

EDIT

Or just translate the whole thing 10 bits to the left and remove last bitand:

unsigned int sum = (((byte * 0x4040400) & 0x30C30C00) * 0x41041) >> 28;

There are lots of great ideas here, but it's getting hard to find them amidst the discussion. Let's use this answer to offer final solutions along with their timing. Please feel free to edit this post and add your own along with timing. If unsure about the timing paste in the code at the bottom and I'll measure it. x64 assembly would be best. I'll happily compile C, but rarely have good results at this level of optimization without lots of tweaking.

Overview

Rephrasing the question to put it into proper context: The goal is to rapidly decode an integer compression format known at "Varint-GB" (or Group Varint). Among other places, it's described in a paper by Daniel Lemire and Leo Boytsov.. I made comments on the first version of this paper of the standard "clearly the author is an idiot" style, and Daniel (the main author of the paper, and not so much an idiot) cunningly roped me in to help code for a followup.

Standard Varint (aka VByte) has a flag at the beginning of each byte determining if it the end of the integer, but this is slow to parse. This version has a single byte 'key' and then 4 compressed integers of payload. The key consists of 4 2-bit fields, each representing the byte-length of the compressed integers that follow. Each can be 1 byte (00), 2 bytes (01), 3 bytes (10), or 4 bytes (11). Each 'chunk' is therefor 5 to 17 bytes long, but always encodes the same number (4) of 32-bit unsigned integers.

Sample Chunk:
  Key:  01 01 10 00  
  Data: [A: two bytes] [B: two bytes] [C: three bytes] [D: one byte]
Decodes to: 00 00 AA AA   00 00 BB BB   00 CC CC CC  00 00 00 DD

The key is an index into a table of 16-byte shuffle patterns, and the actual decoding is done by shuffling the data bytes into the correct spacing with PSHUFB.

vec_t Data = *input
vec_t ShuffleKey = decodeTable[key]     
VEC_SHUFFLE(Data, ShuffleKey) // PSHUFB
*out = Data

In reality, there is usually usually a "delta decoding" step as well, since the original integers generally have been made smaller by compressing the "delta" (difference) between the integers rather than the integers themselves. But the latency for the decode routine usually doesn't matter, since the next iteration is not depending on it.

The Problem Restated

The problem specified here is to to hop from one 'key' to the next. Since there are no dependencies on the decoded Data here (only on the key) I'll ignore the actual decoding and just concentrate on the loop that reads the keys. The function takes a pointer to a key and a count n, and returns the n'th key.

11 cycles

The 'basic' approach is to use a lookup table of 'advance' offsets with the key as index. Lookup any of the 256 keys in offsetTable to get the precalculated (sum + 1) offset to advance. Add that to the current input position, and read the next key. According to Intel's IACA, this loop takes 11 cycles on Sandy Bridge (a cycle counts below on Sandy Bridge as well).

uint32_t decodeBasic(uint8_t *in, size_t count) {
    uint64_t key, advance;
    for (size_t i = count; i > 0; i--) {
        key = *in;
        advance = offsetTable[key];
        in += advance;
    }
    return key;
}

0000000000000000 <decodeBasic>:
   0:   test   %rsi,%rsi
   3:   je     19 <decodeBasic+0x19>
   5:   nopl   (%rax)
   8:   movzbl (%rdi),%eax
   b:   add    0x0(,%rax,8),%rdi
  13:   sub    $0x1,%rsi
  17:   jne    8 <decodeBasic+0x8>
  19:   repz retq 

Block Throughput: 11.00 Cycles       Throughput Bottleneck: InterIteration
   0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
--------------------------------------------------------------
|           |     | 1.0   1.0 |           |     |     | CP | movzx eax, byte ptr [rdi]
| 0.3       | 0.3 |           | 1.0   1.0 |     | 0.3 | CP | add rdi, qword ptr [rax*8]
|           |     |           |           |     | 1.0 |    | sub rsi, 0x1
|           |     |           |           |     |     |    | jnz 0xffffffffffffffe7

10 cycles

From there, we can get down to 10 cycles by rearranging the loop so that we add update the input pointer and start loading the next key at the same time. You may note that I had to use inline assembly to 'encourage' the compiler to produce the output I wanted. I'll also start dropping the outer loop as it (usually) stays the same.

key = *in;
advance = offsetTable[key]; 
for (size_t i = count; i > 0; i--) {
    key = *(in + advance);
    ASM_LEA_ADD_BASE(in, advance);
    advance = offsetTable[key];
}

Block Throughput: 10.00 Cycles       Throughput Bottleneck: InterIteration
|  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
------------------------------------------------------------
|           |     | 1.0   1.0 |           |     |     | CP | movzx eax, byte ptr [rdi+rdx*1]
| 0.5       | 0.5 |           |           |     |     |    | lea rdi, ptr [rdi+rdx*1]
|           |     |           | 1.0   1.0 |     |     | CP | mov rdx, qword ptr [rax*8]
|           |     |           |           |     | 1.0 |    | sub rsi, 0x1
|           |     |           |           |     |     |    | jnz 0xffffffffffffffe2

9 cycles

I'd tried to use POPCNT earlier, but without the suggestions, hints, and ideas from Ira and AShelly I wasn't having much luck. But putting together the pieces, I think I have something that runs the loop in 9 cycles. I've put it into the actual decoder, and the number of Ints/s seems to agree with this. This loop is essentially in assembly, since I couldn't get the compiler to do what I wanted otherwise, muchless multiple compilers.

[Edit: removed extra MOV per comment by AShelly]

uint64_t key1 = *in;
uint64_t key2 = *in;
for (size_t i = count; i > 0; i--) {
    uint64_t advance1, advance2;
    ASM_POPCOUNT(advance1, key1);
    ASM_AND(key2, 0xAA);

    ASM_POPCOUNT(advance2, key2);
    in += advance1;

    ASM_MOVE_BYTE(key1, *(in + advance2 + 4));
    ASM_LOAD_BASE_OFFSET_INDEX_MUL(key2, in, 4, advance2, 1);        
    in += advance2;
 }


Block Throughput: 9.00 Cycles       Throughput Bottleneck: InterIteration
|  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
------------------------------------------------------------
|           | 1.0 |           |           |     |     | CP | popcnt r8, rax
| 1.0       |     |           |           |     |     | CP | and rdx, 0xaa
|           |     |           |           |     | 1.0 | CP | add r8, rdi
|           | 1.0 |           |           |     |     | CP | popcnt rcx, rdx
|           |     | 1.0   1.0 |           |     |     | CP | movzx rax, byte ptr [rcx+r8*1+0x4]
|           |     |           | 1.0   1.0 |     |     | CP | mov rdx, qword ptr [r8+rcx*1+0x4]
| 1.0       |     |           |           |     |     |    | lea rdi, ptr [rcx+r8*1]
|           |     |           |           |     | 1.0 |    | dec rsi
|           |     |           |           |     |     |    | jnz 0xffffffffffffffd0

As an indication of the intricacy of the moving parts in modern processors, I had an interesting experience with a variation of this routine. If I combine the second mov rax line with the and rax, 0xaa by specifying a memory location with and (mov rax, 0xAA; and rax, qword ptr [r8+rcx*1+0x4]), I ended up with a routine that vacillated 30% run to run. I think this is because sometimes the initial conditions leading up to the loop would cause the 'and' micro-op of the load/and to run before the POPCNT for the entire loop.

8 cycles

Anyone?

Evgeny

This is my attempt to implement Evgeny's solution. I haven't been able to get it down to 9 cycles yet, at least for IACA's model of Sandy Bridge (which has been accurate so far). I think the issue is that while ROR has a latency of 1, it takes two micro-ops on P1 or P5. To get a latency of 1, both have to be available. The others are just one micro-op, and thus always a latency of 1. AND, ADD, and MOV can issue on P0, P1, or P5, but SHR cannot be on P1. I can get closer to 10 cycles by adding some extra junk ops which prevent ADD and AND from displacing SHR or ROR, but I'm not sure how to get below 10.

Block Throughput: 10.55 Cycles       Throughput Bottleneck: InterIteration
|  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
------------------------------------------------------------
|           |     | 1.0   1.0 |           |     |     | CP | movzx eax, byte ptr [esi+0x5]
|           |     |           | 1.0   1.0 |     |     | CP | movzx ebx, byte ptr [esi+0x5]
| 0.2       | 0.6 |           |           |     | 0.3 |    | add esi, 0x5
| 0.3       | 0.3 |           |           |     | 0.3 |    | mov ecx, 0x3
| 0.2       | 0.2 |           |           |     | 0.6 |    | mov edx, 0x3
| 1.4       |     |           |           |     | 0.6 | CP | ror al, 0x4
| 0.1       | 0.7 |           |           |     | 0.2 | CP | and ecx, ebx
| 0.6       |     |           |           |     | 0.4 | CP | shr ebx, 0x6
| 0.1       | 0.7 |           |           |     | 0.2 | CP | add ebx, ecx
| 0.3       | 0.4 |           |           |     | 0.3 | CP | and edx, eax
| 0.6       |     |           |           |     | 0.3 | CP | shr eax, 0x6
| 0.1       | 0.7 |           |           |     | 0.2 | CP | add eax, edx
| 0.3       | 0.3 |           |           |     | 0.3 | CP | add esi, ebx
| 0.2       | 0.2 |           |           |     | 0.6 | CP | add esi, eax
  mov al,1
  mov ah,2
  mov bl,3
  mov bh,4
  add ax,bx
  add al,ah
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top