How to get checksums for strided patterns

https://stackoverflow.com/questions/598458

11-09-2019
|

Question

I have a 64 bit number (but only the 42 low order bits are used) and need to computer the sum of the 4 bits at n, n+m, n+m*2 and n+m*3 (note: anything that can produce a sum >4 is invalid) for some fixed m and every value of n that places all the bits in the number

as an example using m=3 and given the 16-bit number

0010 1011 0110 0001

I need to compute

2, 3, 1, 2, 3, 0, 3

Does anyone have any (cool) ideas for ways to do this? I'm fine with bit twiddling.

My current thought is to make bit shifted copies of the input to align the values to be summed and then build a logic tree to do a 4x 1bit adder.

v1 = In;
v2 = In<<3;
v3 = In<<6;
v4 = In<<9;

a1 = v1 ^ v2;
a2 = v1 & v2;
b1 = v3 ^ v4;
b2 = v3 & v4;
c2 = a1 & b1;
d2 = a2 ^ b2;

o1 = a1 ^ b1;
o2 = c2 ^ d2;
o4 = a2 & b2;

This does end up with the bits of the result spread across 3 different ints but oh well.

edit: as it happens I need the histogram of the sums so doing a bit-count of o4, o2&o1, o2 and o1 gives me what I want.

a second solution uses a perfect hash function

arr = [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4];

for(int i = 0; i < N; i++)
{
   out[i] = arr[(In & 0b1001001001) % 30]; 
   In >>= 1;
}

This works by noting that the 4 selected bits can only take on 16 patterns and that (by guess and check) they can be hashed into 0-15 using mod 30. From there, a table of computed values gives the needed sum. As it happens only 3 of the 4 strides I need work this way.

p.s.

Correct trumps fast. Fast trumps clear. I expect to be running this millions of time.

Solution

Maybe I am crazy, but I am having fun :D This solution is based upon the usage of data parallelism and faking a vector cpu without actually using SSE intrinsics or anything similar.

unsigned short out[64];
const unsigned long long mask      = 0x0249024902490249ul;
const unsigned long long shiftmask = 0x0001000100010001ul;

unsigned long long t = (unsigned short)(in >> 38) | (unsigned long long)(unsigned short)(in >> 39) > 40) > 41) << 48;
t &= mask;
*((unsigned long long*)(out + 38)) = (t & shiftmask) + (t >> 3 & shiftmask) + (t >> 6 & shiftmask) + (t >> 9 & shiftmask);

[... snipsnap ...]

t = (unsigned short)(in >> 2) | (unsigned long long)(unsigned short)(in >> 3) > 4) > 5) << 48;
t &= mask;
*((unsigned long long*)(out + 2)) = (t & shiftmask) + (t >> 3 & shiftmask) + (t >> 6 & shiftmask) + (t >> 9 & shiftmask);

t = (unsigned short)in | (unsigned long long)(unsigned short)(in >> 1) << 16;
t &= mask;
*((unsigned int*)out) = (unsigned int)((t & shiftmask) + (t >> 3 & shiftmask) + (t >> 6 & shiftmask) + (t >> 9 & shiftmask));

By reordering the computations, we can further reduce the execution time significantly, since it drastically reduces the amount of times that something is loaded into the QWORD. A few other optimizations are quite obvious and rather minor, but sum up to another interesting speedup.

unsigned short out[64];
const unsigned long long Xmask = 0x249024902490249ull;
const unsigned long long Ymask = 0x7000700070007u;

unsigned long long x = (in >> 14 & 0xFFFFu) | (in >> 20 & 0xFFFFu) > 26 & 0xFFFFu) > 32) << 48;
unsigned long long y;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[32] = (unsigned short)(y >> 48);
out[26] = (unsigned short)(y >> 32);
out[20] = (unsigned short)(y >> 16);
out[14] = (unsigned short)(y      );

x >>= 1;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[33] = (unsigned short)(y >> 48);
out[27] = (unsigned short)(y >> 32);
out[21] = (unsigned short)(y >> 16);
out[15] = (unsigned short)(y      );

[snisnap]

x >>= 1;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[37] = (unsigned short)(y >> 48);
out[31] = (unsigned short)(y >> 32);
out[25] = (unsigned short)(y >> 16);
out[19] = (unsigned short)(y      );

x >>= 1;
x &= 0xFFFF000000000000ul;
x |= (in & 0xFFFFu) | (in >> 5 & 0xFFFFu) > 10 & 0xFFFFu) << 32;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[38] = (unsigned short)(y >> 48);
out[10] = (unsigned short)(y >> 32);
out[ 5] = (unsigned short)(y >> 16);
out[ 0] = (unsigned short)(y      );

[snipsnap]

x >>= 1;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[ 9] = (unsigned short)(y >> 16);
out[ 4] = (unsigned short)(y      );

Running times for 50 million executions in native c++ (all ouputs verified to match ^^) compiled as a 64 bit binary on my pc:
Array based solution: ~5700 ms
Naive hardcoded solution: ~4200 ms
The first solution: ~2400 ms
The second solution: ~1600 ms

OTHER TIPS

A suggestion that I don't want to code right now is to use a loop, an array to hold partial results, and constants to pick up the bits m at a time.

loop 
   s[3*i] += x & (1 << 0);
   s[3*i+1] += x & (1 << 1);
   s[3*i+2] += x & (1 << 2);
   x >> 3;

This will pick too many bits in each sum. But you can also keep track of the intermediate results and subtract from the sums as you go, to account for the bit that may not be there anymore.

loop 
   s[3*i] += p[3*i]   = x & (1 << 0);
   s[3*i+1] += p[3*i+1] = x & (1 << 1);
   s[3*i+2] += p[3*i+2] = x & (1 << 2);

   s[3*i] -= p[3*i-10];
   s[3*i+1] -= p[3*i-9];
   s[3*i+2] -= p[3*i-8];
   x >> 3;

with the appropriate bounds checking, of course.

The fastest approach is to just hardcode the sums themselves.

s[0] = (x & (1<<0)) + (x & (1<<3)) + (x & (1<<6)) + (x & (1<<9));

etc. (The shifts occur at compile time.)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow