符号なし長い長いモジュロ動作を実行するループの性能を加速する

https://stackoverflow.com//questions/22064566

23-12-2019
|

質問

16ビットモジュラスで除算unsigned long long番号の残りを見つける多くの操作を実行する必要があります。

unsigned long long largeNumber;
long residues[100];
unsigned long modules[100];
intiModules(modules); //set different 16-bit values

for(int i = 0; i < 100; i++){
     residues[i] = largeNumber % modules[i];
}

このループを加速する方法は？

反復数は大きくない（32-128）、このループは非常に頻繁に行われているので、その速度は重要です。

解決

定数（そしてそれらのうち65536だけさっている）は、微調整によって追従/前の逆調整の乗算によって実行することができる。この方法は限られた範囲で正確であるため、64ビットのオペランドをより小さな値に縮小するためのいくつかの技術を使用することができます（これはまだ元の値に一致しています）：

// pseudo code -- not c
a = 0x1234567890abcdefULL;
a = 0x1234 << 48 + 0x5678 << 32 + 0x90ab << 16 + 0xcdef;

a % N === ((0x1234 * (2^48 % N) +     // === means 'is congruent'
           (0x5678 * (2^32 % N)) +    // ^ means exponentation
           (0x90ab * (2^16 % N)) + 
           (0xcdef * 1)) % N;

中間値は（小）乗算のみで計算することができ、最終残余（％n）は逆数乗算で計算される可能性がある。

他のヒント

If speed is critical, according to this answer about branch prediction and this one, loop unrolling may be of help, avoiding the test induced by the for instruction, reducing the number of tests and improving "branch prediction".

The gain (or none, some compilers do that optimization for you) varies based on architecture / compiler.

On my machine, changing the loop while preserving the number of operations from

for(int i = 0; i < 500000000; i++){
    residues[i % 100] = largeNumber % modules[i % 100];
}

for(int i = 0; i < 500000000; i+=5){
    residues[(i+0) % 100] = largeNumber % modules[(i+0) % 100];
    residues[(i+1) % 100] = largeNumber % modules[(i+1) % 100];
    residues[(i+2) % 100] = largeNumber % modules[(i+2) % 100];
    residues[(i+3) % 100] = largeNumber % modules[(i+3) % 100];
    residues[(i+4) % 100] = largeNumber % modules[(i+4) % 100];
}

with gcc -O2 the gain is ~15%. (500000000 instead of 100 to observe a more significant time difference)

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow