Question

I am writing a code library in x86-64 assembly-language to provide all conventional bitwise, shift, logical, compare, arithmetic and math functions for s0128, s0256, s0512, s1024 signed-integer types and f0128, f0256, f0512, f1024 floating-point types. So far I'm working on the signed-integer types, because the floating-point functions will likely call some internal routines written for the integer types.

So far I've written and tested functions to perform the various unary operators, compare operators, and the add, subtract and multiply operators.

Now I'm trying to decide how to implement functions for the divide operators.

My first thought was, "Newton-Raphson must be the best approach". Why? Because it converges very quickly given a good seed (starting guess), and I figure I should be able to figure out how to execute the native 64-bit divide instruction on the operands to get an excellent seed value. In fact, if the seed value is precise to 64-bits, to get the correct answers should only take:

`s0128` : 1~2 iterations : (or 1 iteration  plus 1~2 "test subtracts")
`s0256` : 2~3 iterations : (or 2 iterations plus 1~2 "test subtracts")
`s0512` : 3~4 iterations : (or 3 iterations plus 1~2 "test subtracts")
`s1024` : 4~5 iterations : (or 4 iterations plus 1~2 "test subtracts")

However, a bit more thinking about this question makes me wonder. For example, I recall the core routine I wrote that performs the multiply operation for all the large integer types:

s0128 :   4 iterations ==   4 (128-bit = 64-bit * 64-bit) multiplies +  12 adds
s0256 :  16 iterations ==  16 (128-bit = 64-bit * 64-bit) multiplies +  48 adds
s0512 :  64 iterations ==  64 (128-bit = 64-bit * 64-bit) multiplies + 192 adds
s1024 : 256 iterations == 256 (128-bit = 64-bit * 64-bit) multiplies + 768 adds

The growth in operations for the wider data-types is quite substantial, even though the loop is fairly short and efficient (including cache-wise). This loop writes each 64-bit portion of the result only once, and never reads back any portion of the result for further processing.

This got me thinking about whether more conventional shift-and-subtract type divide algorithms might be faster, especially for the larger types.

My first thought was this:

result = dividend / divisor                  // if I remember my terminology
remainder = dividend - (result * divisor)    // or something along these lines

#1: To compute each bit, generally the divisor is subtracted from the dividend IF the divisor is less than or equal to the dividend. Well, usually we can determine the divisor is definitely less-than or definitely greater-than the dividend by only inspecting their most-significant 64-bit portions. Only when those ms64-bit portions are equal must the routine check the next lower 64-bit portions, and only when they are equal must we check even lower, and so forth. Therefore, on almost every iteration (computing each bit of result), we can greatly reduce the instructions executed to compute this test.

#2: However... on average, about 50% of the time we will find we need to subtract the divisor from the dividend, so we will need to subtract their entire widths anyway. In this case we actually executed more instructions than we would have in the conventional approach (where we first subtract them, then test the flags to determine whether the divisor <= dividend). Therefore, half the time we realize a saving, and half the time we realize a loss. On large types like s1024 (which contains -16- 64-bit components), savings are substantial and losses are small, so this approach should realize a large net savings. On tiny types like s0128 (which contains -2- 64-bit components), savings are tiny and losses significant but not huge.


So, my question is, "what are the most efficient divide algorithms", given:

#1: modern x86-64 CPUs like FX-8350
#2: executing in 64-bit mode only (no 32-bit)
#3: implementation entirely in assembly-language
#4: 128-bit to 1024-bit integer operands (nominally signed, but...)

NOTE: My guess is, the actual implementation will operate only on unsigned integers. In the case of multiply, it turned out to be easier and more efficient (maybe) to convert negative operands to positive, then perform unsigned-multiply, then negate the result if exactly one original operand was negative. However, I'll consider a signed-integer algorithm if it is efficient.

NOTE: If the best answers are different for my floating-point types (f0128, f0256, f0512, f1024), please explain why.

NOTE: My internal core unsigned-multiply routine, which performs the multiply operation for all these integer data-types, produces a double-width result. In other words:

u0256 = u0128 * u0128     // cannot overflow
u0512 = u0256 * u0256     // cannot overflow
u1024 = u0512 * u0512     // cannot overflow
u2048 = u1024 * u1024     // cannot overflow

My code library offers two versions of multiply for each signed-integer data-type:

s0128 = s0128 * s0128     // can overflow (result not fit in s0128)
s0256 = s0256 * s0256     // can overflow (result not fit in s0256)
s0512 = s0512 * s0512     // can overflow (result not fit in s0512)
s1024 = s1024 * s1024     // can overflow (result not fit in s1024)

s0256 = s0128 * s0128     // cannot overflow
s0512 = s0256 * s0256     // cannot overflow
s1024 = s0512 * s0512     // cannot overflow
s2048 = s1024 * s1024     // cannot overflow

This is consistent with the policy of my code library to "never lose precision" and "never overflow" (errors are returned when the answer is invalid due to precision-loss or due to overflow/underflow). However, when double-width return value functions are called, no such errors can occur.

Était-ce utile?

La solution

Surely you know about the existing arbitrary precision packages (e.g, http://gmplib.org/) and how they operate? They are generally designed to run "as fast as possible" for arbitrary precisions.

If you specialized them for fixed sizes (e.g, applied [manually] partial evaluation techniques to fold constants and unroll loops) I'd expect you to get pretty good routines for specific fixed-size precisions of the kind you want.

Also if you haven't seen it, check out D. Knuth's Seminumerical Algorithms, and oldie but really goodie, which provides key algorithms for multi-precision arithmetic. (Most of the packages are based on these ideas but Knuth has great explanations and an awful lot right).

The key idea is to treat multi-precision numbers as if they were very-big-radix numbers (e.g., radix 2^64), and apply standard 3rd-grade arithmetic to the "digits" (e.g. 64-bit words). Divide consists of "estimate quotient (big-radix) digit, multiply estimate by divisor, subtract from dividend, shift left one digit, repeat" until you get enough digits to satisfy you. For division, yes, its all unsigned (doing sign handling in wrappers). The basic trick is estimating-a-quotient digit well (using single precision instructions the processor provides you), and doing fast multi-precision multiplies by single digits. See Knuth for details. See technical research papers on multi-precision arithmetic (you get to do some research) for exotic ("fastest possible") improvements.

Autres conseils

The "large-radix" approaches are more efficient for the kinds of huge data-types you mention, especially if you can execute 128-bit divided by 64-bit instructions in assembly-language.

While Newton-Raphson iteration does converge quickly, each iteration requires too huge a number of multiply and accumulate steps for each iteration.

For the multiplication, have a look here:

http://www.math.niu.edu/~rusin/known-math/99/karatsuba http://web.archive.org/web/20141114071302/http://www.math.niu.edu/~rusin/known-math/99/karatsuba

Basically, it allows doing a 1024 x 1024 multiplication using three (instead of four) 512 x 512 bit multiplications. Or nine 256 x 256 bit, or twenty-seven 128 x 128 bit. The added complexity might not beat brute force even for 1024 x 1024, but probably for bigger products. That's the simplest of the "fast" algorithms, using n ^ (log 3 / log 2) = n^1.585 multiplications.

I'd advice to not use assembler. Implement 64 x 64 -> 128 bit multiplication with inline assembler, same with add-with-carry (I think gcc and clang might have built-in operations for this nowadays); then you can for example multiply n bits x 256 bits (any number of words times 4 words) in parallel, avoiding all the latency of multiplication, without going mad with assembler.

For large number of bits, I learned the quickest algorithm goes like this: Instead of dividing x / y, you calculate 1 / y and multiply by x. To calculate 1 / y:

1 / y is the solution t of (1 / ty) - 1 = 0.
Newton iteration: t' = t - f (t) / f' (t) 
  = t - (1 / ty - 1) / (-1 / t^2 / y)
  = t + (t - t^2 y)
  = 2t - t^2 y

The Newton iteration converges quadratically. Now the trick: If you want 1024 bit precision, you start with 32 bit, one iteration step gives 64 bit, next iteration step gives 128 bit, then 256, then 512, then 1024. So you do many iterations, but only the last one uses full precision. So all in all, you do one 512 x 512-> 1024 product (t^2), one 1024 x 1024 -> 1024 product (t^2 y = 1 / y), and another 1024 x 1024 product (x * (1 / y)).

Of course you have to figure out very precisely what the error is after each iteration; You'll probably have to start with say 40 bit, lose a bit of precision in each step so you have enough at the end.

I have no idea at which point this would run faster than a straightforward brute-force division as you learned it at school. And y may have fewer than the full number of bits.

The alternative is brute force. You could take the highest 128 bits of x, divide by the highest 64 bits of y, and get the highest 64 bit r of the quotient, then subtract r x y from x. And repeat as needed, carefully checking how big the errors are.

Divisions are slooow. So you calculate 2^127 / (highest 64 bits of y) once. Then to estimate the next 64 bit, multiply the highest 64 bit of x with this number and shift everything into the right place. Multiplication is much faster than division.

Next you'll find that all these operations have long latencies. For example, 5 cycles to get a result, but you could do a multiplication every cycle. So: Estimate 64 bit of the result. Start subtracting r * y at the high end of x, so you can estimate the next 64 bit as quickly as possible. Then you subtract two or more products from x simultaneously, to avoid the penalty from latency. Implementing this is tough. Some things may not be worth it even for 1024 bits (which is just sixteen 64-bit integers).

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top