Question

I need to perform a many operations of finding remainders of the division unsigned long long number by the 16-bit modulus:

unsigned long long largeNumber;
long residues[100];
unsigned long modules[100];
intiModules(modules); //set different 16-bit values

for(int i = 0; i < 100; i++){
     residues[i] = largeNumber % modules[i];
}

How I can accelerate this loop?

The iteration count is not large (32-128), but this loop performed very often so its speed is critical.

Was it helpful?

Solution

Division by a constant (and there are only 65536 of them) can be performed by multiplication of the reciprocal followed/preceded by some fine-tuning. Since this method is accurate for a limited range, one can use some techniques to reduce the 64-bit operand to a much smaller value (which is still congruent to the original value):

// pseudo code -- not c
a = 0x1234567890abcdefULL;
a = 0x1234 << 48 + 0x5678 << 32 + 0x90ab << 16 + 0xcdef;

a % N === ((0x1234 * (2^48 % N) +     // === means 'is congruent'
           (0x5678 * (2^32 % N)) +    // ^ means exponentation
           (0x90ab * (2^16 % N)) + 
           (0xcdef * 1)) % N;

The intermediate value can be calculated with (small) multiplications only and the final remainder (%N) has potential to be calculated with reciprocal multiplication.

OTHER TIPS

If speed is critical, according to this answer about branch prediction and this one, loop unrolling may be of help, avoiding the test induced by the for instruction, reducing the number of tests and improving "branch prediction".

The gain (or none, some compilers do that optimization for you) varies based on architecture / compiler.

On my machine, changing the loop while preserving the number of operations from

for(int i = 0; i < 500000000; i++){
    residues[i % 100] = largeNumber % modules[i % 100];
}

to

for(int i = 0; i < 500000000; i+=5){
    residues[(i+0) % 100] = largeNumber % modules[(i+0) % 100];
    residues[(i+1) % 100] = largeNumber % modules[(i+1) % 100];
    residues[(i+2) % 100] = largeNumber % modules[(i+2) % 100];
    residues[(i+3) % 100] = largeNumber % modules[(i+3) % 100];
    residues[(i+4) % 100] = largeNumber % modules[(i+4) % 100];
}

with gcc -O2 the gain is ~15%. (500000000 instead of 100 to observe a more significant time difference)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top