what is fastest x86-64 assembly-language divide algorithm for huge numbers?

Question 1

Surely you know about the existing arbitrary precision packages (e.g, http://gmplib.org/) and how they operate? They are generally designed to run "as fast as possible" for arbitrary precisions.

If you specialized them for fixed sizes (e.g, applied [manually] partial evaluation techniques to fold constants and unroll loops) I'd expect you to get pretty good routines for specific fixed-size precisions of the kind you want.

Also if you haven't seen it, check out D. Knuth's Seminumerical Algorithms, and oldie but really goodie, which provides key algorithms for multi-precision arithmetic. (Most of the packages are based on these ideas but Knuth has great explanations and an awful lot right).

The key idea is to treat multi-precision numbers as if they were very-big-radix numbers (e.g., radix 2^64), and apply standard 3rd-grade arithmetic to the "digits" (e.g. 64-bit words). Divide consists of "estimate quotient (big-radix) digit, multiply estimate by divisor, subtract from dividend, shift left one digit, repeat" until you get enough digits to satisfy you. For division, yes, its all unsigned (doing sign handling in wrappers). The basic trick is estimating-a-quotient digit well (using single precision instructions the processor provides you), and doing fast multi-precision multiplies by single digits. See Knuth for details. See technical research papers on multi-precision arithmetic (you get to do some research) for exotic ("fastest possible") improvements.

Question 2

The "large-radix" approaches are more efficient for the kinds of huge data-types you mention, especially if you can execute 128-bit divided by 64-bit instructions in assembly-language.

While Newton-Raphson iteration does converge quickly, each iteration requires too huge a number of multiply and accumulate steps for each iteration.

Question 3

For the multiplication, have a look here:

~~http://www.math.niu.edu/~rusin/known-math/99/karatsuba~~ http://web.archive.org/web/20141114071302/http://www.math.niu.edu/~rusin/known-math/99/karatsuba

Basically, it allows doing a 1024 x 1024 multiplication using three (instead of four) 512 x 512 bit multiplications. Or nine 256 x 256 bit, or twenty-seven 128 x 128 bit. The added complexity might not beat brute force even for 1024 x 1024, but probably for bigger products. That's the simplest of the "fast" algorithms, using n ^ (log 3 / log 2) = n^1.585 multiplications.

I'd advice to not use assembler. Implement 64 x 64 -> 128 bit multiplication with inline assembler, same with add-with-carry (I think gcc and clang might have built-in operations for this nowadays); then you can for example multiply n bits x 256 bits (any number of words times 4 words) in parallel, avoiding all the latency of multiplication, without going mad with assembler.

Question 4

For large number of bits, I learned the quickest algorithm goes like this: Instead of dividing x / y, you calculate 1 / y and multiply by x. To calculate 1 / y:

1 / y is the solution t of (1 / ty) - 1 = 0.
Newton iteration: t' = t - f (t) / f' (t) 
  = t - (1 / ty - 1) / (-1 / t^2 / y)
  = t + (t - t^2 y)
  = 2t - t^2 y

The Newton iteration converges quadratically. Now the trick: If you want 1024 bit precision, you start with 32 bit, one iteration step gives 64 bit, next iteration step gives 128 bit, then 256, then 512, then 1024. So you do many iterations, but only the last one uses full precision. So all in all, you do one 512 x 512-> 1024 product (t^2), one 1024 x 1024 -> 1024 product (t^2 y = 1 / y), and another 1024 x 1024 product (x * (1 / y)).

Of course you have to figure out very precisely what the error is after each iteration; You'll probably have to start with say 40 bit, lose a bit of precision in each step so you have enough at the end.

I have no idea at which point this would run faster than a straightforward brute-force division as you learned it at school. And y may have fewer than the full number of bits.

Question 5

The alternative is brute force. You could take the highest 128 bits of x, divide by the highest 64 bits of y, and get the highest 64 bit r of the quotient, then subtract r x y from x. And repeat as needed, carefully checking how big the errors are.

Divisions are slooow. So you calculate 2^127 / (highest 64 bits of y) once. Then to estimate the next 64 bit, multiply the highest 64 bit of x with this number and shift everything into the right place. Multiplication is much faster than division.

Next you'll find that all these operations have long latencies. For example, 5 cycles to get a result, but you could do a multiplication every cycle. So: Estimate 64 bit of the result. Start subtracting r * y at the high end of x, so you can estimate the next 64 bit as quickly as possible. Then you subtract two or more products from x simultaneously, to avoid the penalty from latency. Implementing this is tough. Some things may not be worth it even for 1024 bits (which is just sixteen 64-bit integers).