我需要通过16位模量来执行多个查找分割生成的剩余数的操作:

unsigned long long largeNumber;
long residues[100];
unsigned long modules[100];
intiModules(modules); //set different 16-bit values

for(int i = 0; i < 100; i++){
     residues[i] = largeNumber % modules[i];
}
.

如何加速这个循环?

迭代计数不大(32-128),但这种循环经常执行,因此其速度至关重要。

有帮助吗?

解决方案

通过常数(并且它们只有65536),可以通过乘法来执行互动次序/之前的微调。由于该方法对于有限的范围准确,因此可以使用一些技术来将64位操作数减少到更小的值(其仍然是原始值的一致):

// pseudo code -- not c
a = 0x1234567890abcdefULL;
a = 0x1234 << 48 + 0x5678 << 32 + 0x90ab << 16 + 0xcdef;

a % N === ((0x1234 * (2^48 % N) +     // === means 'is congruent'
           (0x5678 * (2^32 % N)) +    // ^ means exponentation
           (0x90ab * (2^16 % N)) + 
           (0xcdef * 1)) % N;
.

可以用(小)乘法计算中间值,并且最终剩余部分(%n)具有互易乘法的可能性。

其他提示

If speed is critical, according to this answer about branch prediction and this one, loop unrolling may be of help, avoiding the test induced by the for instruction, reducing the number of tests and improving "branch prediction".

The gain (or none, some compilers do that optimization for you) varies based on architecture / compiler.

On my machine, changing the loop while preserving the number of operations from

for(int i = 0; i < 500000000; i++){
    residues[i % 100] = largeNumber % modules[i % 100];
}

to

for(int i = 0; i < 500000000; i+=5){
    residues[(i+0) % 100] = largeNumber % modules[(i+0) % 100];
    residues[(i+1) % 100] = largeNumber % modules[(i+1) % 100];
    residues[(i+2) % 100] = largeNumber % modules[(i+2) % 100];
    residues[(i+3) % 100] = largeNumber % modules[(i+3) % 100];
    residues[(i+4) % 100] = largeNumber % modules[(i+4) % 100];
}

with gcc -O2 the gain is ~15%. (500000000 instead of 100 to observe a more significant time difference)

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top