Multiword addition in C

Question 1

256-bit version

__uint128_t a[2], b[2], c[2];        // c = a + b
c[0] = a[0] + b[0];                  // add low part
c[1] = a[1] + b[1] + (c[0] < a[0]);  // add high part and carry

Edit: 192-bit version. This way you can eliminate the 128-bit comparison like what @harold's stated:

struct uint192_t {
    __uint128_t H;
    uint64_t L;
} a, b, c;  // c = a + b
c.L = a.L + b.L;
c.H = a.H + b.H + (c.L < a.L);

Alternatively you can use the integer overflow builtins or checked arithmetic builtins

bool carry = __builtin_uaddl_overflow(a.L, b.L, &c.L);
c.H = a.H + b.H + carry;

Demo on Godbolt

If you do a lot of additions in a loop you should consider using SIMD and/or running them in parallel with multithreading. For SIMD you may need change the layout of the type so that you can add all the low parts at once and all the high parts at once. Once possible solution is an array of struct of array as suggested here practical BigNum AVX/SSE possible?

SSE2:   llhhllhhllhhllhh
AVX2:   llllhhhhllllhhhh
AVX512: llllllllhhhhhhhh

With AVX-512 you can add eight 64-bit values at once. So you can add eight 192-bit values in 3 instructions plus a few more for the carry. For more information read Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

With AVX-2 or AVX-512 you may also have very fast horizontal add so it may also worth a try for 256-bit even if you don't have parallel addition chains. But for 192-bit addition then 3 add/adc instructions would be much faster

There are also many libraries with a fixed-width integer type. For example Boost.Multiprecision

#include <boost/multiprecision/cpp_int.hpp>

using namespace boost::multiprecision;

uint256_t myUnsignedInt256 = 1;

Some other libraries:

ttmath: ttmath:UInt<3> (an int type with 3 limbs, which is 192 bits on 64-bit computers)
uint256_t

See also

C++ 128/256-bit fixed size integer types

Question 2

You could test if the "add (low < oldlow) to simulate carry"-technique from this answer is fast enough. It's slightly complicated by the fact that low is an __uint128_t here, that could hurt code generation. You might try it with 4 uint64_t's as well, I don't know whether that'll be better or worse.

If that's not good enough, drop to inline assembly, and directly use the carry flag - it doesn't get any better than that, but you'd have the usual downsides of using inline assembly.