overflow possibilities in modular exponentiation by squaring

Question 1

If the modulus is close to the limit of the largest integer type you can use, things get somewhat complicated. If you can't use a library that implements biginteger arithmetic, you can roll a modular multiplication yourself by splitting the factors in low-order and high-order parts.

If the modulus m is so large that 2*(m-1) overflows, things get really fussy, but if 2*(m-1) doesn't overflow, it's bearable.

Let us suppose you have and use a 64-bit unsigned integer type.

You can calculate the modular product by splitting the factors into low and high 32 bits, the product then splits into

a = a1 + (a2 << 32)    // 0 <= a1, a2 < (1 << 32)
b = b1 + (b2 << 32)    // 0 <= b1, b2 < (1 << 32)
a*b = a1*b1 + (a1*b2 << 32) + (a2*b1 << 32) + (a2*b2 << 64)

To calculate a*b (mod m) with m <= (1 << 63), reduce each of the four products modulo m,

p1 = (a1*b1) % m;
p2 = (a1*b2) % m;
p3 = (a2*b1) % m;
p4 = (a2*b2) % m;

and the simplest way to incorporate the shifts is

for(i = 0; i < 32; ++i) {
    p2 *= 2;
    if (p2 >= m) p2 -= m;
}

the same for p3 and with 64 iterations for p4. Then

s = p1+p2;
if (s >= m) s -= m;
s += p3;
if (s >= m) s -= m;
s += p4;
if (s >= m) s -= m;
return s;

That way is not very fast, but for the few multiplications needed here, it may be fast enough. A small speedup should be obtained by reducing the number of shifts; first calculate (p4 << 32) % m,

for(i = 0; i < 32; ++i) {
    p4 *= 2;
    if (p4 >= m) p4 -= m;
}

then all of p2, p3 and the current value of p4 need to be multiplied with 2³² modulo m,

p4 += p3;
if (p4 >= m) p4 -= m;
p4 += p2;
if (p4 >= m) p4 -= m;
for(i = 0; i < 32; ++i) {
    p4 *= 2;
    if (p4 >= m) p4 -= m;
}
s = p4+p1;
if (s >= m) s -= m;
return s;

Question 2

You can perform your multiplications in several stages. For example, say you want to compute X*Y mod n. Take X and Y and write them as X = 10^9*X_1 + X_0, Y = 10^9*Y_1 + Y_0. Then compute all four products X_i*Y_j mod n, and finally compute X = 10^18*(X_1*Y_1 mod n) + 10^9*( X_0*Y_1 + X_1*Y_0 mod n) + X_0*Y_0. Note that in this case, you are operating with numbers half the size of the maximum allowed.

If splitting in two parts do not suffice (I suspect this is the case), split in three parts using the same schema. Splitting in three should work.

A simpler approach is just to multiply the school way. It corresponds to the previous approach, but writing one number in as many parts as digits it has.

Good luck!