Multiplying two 128-bit ints

Question 1

Summarizing your question: How can you add two arrays of (unsigned) integers propagating the carry.

uint16_t foo[4];  // 0000 aaaa FFFF cccc
uint16_t bar[4];  // dddd eeee FFFF 0000

The good point is that 'FFFF+FFFF+1' is simply (1)FFFF. Thus the carry can always be added in each word without producing an extra carry (as if the sum could be 20000).

Making a temporary sum: sum = foo[3] + bar[3] + carry; with carry being initially 0, either this sum produces a new carry, or not.

Carry is produced from (A+B), if (A+B) < A
When summing (A+B+c), the carry is produced if ((A + c) < A) || (((A + c) + B) < B)

Another possibility is to calculate "multi-bit carry" by summing up several terms in columns, which occurs often in bignum multiplications:

            AAAA BBBB CCCC
       DDDD EEEE FFFF ....
  GGGG HHHH IIII .... ....
--------------------------
  col4 col3 col2 col1 col0

Now each column produces 32-bit or 64-bit result and a carry that doesn't necessarily fit a single bit.

uint32_t sum_low = carry_in_from_previous_column;
uint32_t sum_high = 0;

for (i = 0; i < N; i++) {
     sum_low += matrix[i][column] & 0xffff;
     sum_high += matrix[i][column] >> 16;
}
sum_high += sum_low >> 16;    // add the multibit half carry

result = (sum_low & 0xffff) | (sum_high << 16);
carry_out = sum_high >> 16;

Question 2

If you're on gcc or clang you can use __int128 and unsigned __int128 directly.

Question 3

You are stuck in an infinite loop because i += 1/32 is the same as i += 0.

Also: note:memcpy(&d[3l/2-i], dstM1, 1/8); is memcpy(&d[1-i], dstM1, 0);