Computing a correctly rounded / an almost correctly rounded floating-point cubic root

Question 1

I am afraid I do not know how to guarantee a correctly rounded double-precision cube root, but can offer one that is very nearly correctly rounded as also mentioned in the question. In other words, the maximum error appears to be very close to 0.5 ulp.

Peter Markstein, "IA-64 and Elementary Functions: Speed and Precision" (Prentice-Hall 2000)

presents efficient FMA-based techniques for correctly rounding reciprocal, square root, and reciprocal square root, but it does not cover the cube root in this regard. In general Markstein's approach requires a preliminary result that is accurate to within 1 ulp prior to the final rounding sequence. I do not have the mathematical wherewithal to extend his technique to the rounding of cube roots, but it seems to me that this should be possible in principle, being a challenge somewhat similar to the reciprocal square root.

Bit-wise algorithms lend themselves easily to the computation of roots with correct rounding. Since tie cases for the IEEE-754 rounding mode to-nearest-or-even cannot occur, one simply needs to carry the computation until it has produced all mantissa bits plus one round bit. The bit-wise algorithm for square root, based on the binomial theorem, is well known in both non-restoring and restoring variants and has been the basis of hardware implementations. The same approach via binomial theorem works for the cube root, and there is a little-known paper that lays out the details of a non-restoring implementation:

H. Peng, “Algorithms for Extracting Square Roots and Cube Roots,” Proceedings 5th IEEE International Symposium on Computer Arithmetic, pp. 121-126, 1981.

Best I can tell from experimenting with it this works well enough for the extraction of cube roots from integers. Since each iteration produces only a single result bit it is not exactly fast. For applications in floating-point arithmetic it has the drawback of using a couple of bookkeeping variables that require roughly twice the number of bits of the final result. This means one would need to use 128-bit integer arithmetic to implement a double-precision cube root.

My C99 code below is based on Halley's rational method for the cube root which has cubic convergence meaning that the initial approximation does not have to be very accurate as the number of valid digits triples in every iteration. The computation can be arranged in various ways. Generally it is numerically advantageous to arrange iterative schemes as

new_guess := old_guess + correction

since for a sufficiently close initial guess, correction is significantly smaller than old_guess. This leads to the following iteration scheme for the cube root:

x := x - x * (x³ - a) / (2*x³ + a)

This particular arrangement is also listed in Kahan's notes on cube root. It has the further advantage of lending itself naturally to the use of FMA (fused-multiply add). One drawback is that the computation of 2*x³ could lead to overflow, therefore an argument reduction scheme is required for at least part of the double-precision input domain. In my code I simply apply the argument reduction to all non-exceptional inputs, based on straightforward manipulation of the exponents of IEEE-754 double-precision operands.

The interval [0.125,1) serves as the primary approximation interval. A polynomial minimax approximation is used that returns an initial guess in [0.5,1]. The narrow range facilitates the use of single-precision arithmetic for the low-accuracy portions of the computation.

I cannot prove anything about the error bounds of my implementation, however, testing with 200 million random test vectors against a reference implementation (accurate to about 200 bits) found a total of 277 incorrectly rounded results (so an error rate of roughly 1.4 ppm) with a maximum error of 0.500012 ulps.

double my_cbrt (double a)
{
    double b, u, v, r;
    float bb, uu, vv;
    int e, f, s;

    if ((a == 0.0) || isinf(a) || isnan(a)) {
        /* handle special cases */
        r = a + a;
    } else {
        /* strip off sign-bit */
        b = fabs (a);
        /* compute exponent adjustments */
        b = frexp (b, &e);
        s = e - 3*342;
        f = s / 3;
        s = s - 3 * f;
        f = f + 342;
        /* map argument into the primary approximation interval [0.125,1) */
        b = ldexp (b, s);
        bb = (float)b;
        /* approximate cube root in [0.125,1) with relative error 5.22e-3 */
        uu =                0x1.2f32c0p-1f;
        uu = fmaf (uu, bb, -0x1.62cc2ap+0f);
        uu = fmaf (uu, bb,  0x1.7546e0p+0f);
        uu = fmaf (uu, bb,  0x1.5d0590p-2f);
        /* refine cube root using two Halley iterations w/ cubic convergence */
        vv = uu * uu;
        uu = fmaf (fmaf (vv, uu, -bb) / fmaf (vv, 2.0f*uu, bb), -uu, uu);
        u = (double)uu;
        v = u * u; // this product is exact
        r = fma (fma (v, u, -b) / fma (v, 2.0*u, b), -u, u);
        /* map back from primary approximation interval by jamming exponent */
        r = ldexp (r, f);
        /* restore sign bit */
        r = copysign (r, a);
    }
    return r;
}

Question 2

Given that there are lots of easily computable rational points on the curve x = y^3, I’m tempted to reduce about s^3 ~ x, with s rational and just a few bits wide. Then you have:

cbrt(x) = s * cbrt(1 + (x - s^3)/s)

Obvious thing then is to evaluate the correction term using your favorite series approximation, and compute a residual via head-tail FMA arithmetic to bump the result up or down by an ulp if needed (you won’t need the full computation most of the time, obviously).

This isn’t totally in the spirit of the question, but can definitely be made to work, and it’s very easy to prove the necessary bounds this way. Hopefully someone else can suggest something more clever (I used up my cleverness for the month already).

Question 3

I have recently written a correctly-rounded cube root, implemented in https://github.com/mockingbirdnest/Principia/blob/master/numerics/cbrt.cpp and documented in https://github.com/mockingbirdnest/Principia/blob/master/documentation/cbrt.pdf. That implementation is in C++ and uses Intel intrinsics, but it should be straightforward to adapt that to the language and libraries of one’s choice.

Brief summary of the actual computation below; the references here use bibliographic codes from that document because I am apparently too disreputable to link many things at this juncture. References to “part I+” or “appendix [A-Z]” are to cbrt.pdf.

Correct rounding.

This can be done by using a nearly-correct faithful method and, if the result is too close to a tie, following that up with one round of the classical digit-by-digit algorithm (the one that is like long division) to get the 54th bit; since that algorithm computes bits of the result rounded toward 0, and since there are no halfway cases for the cube root, this bit suffices to determine rounding to nearest.

It is possible to get the nearly-correct method to be both faster than most existing implementations of the cube root, and correct enough that the average cost is essentially unaffected by the correction step; see appendices D and F.

Remarks on the faithful method without FMA.

The « nearly-correct faithful method » part is perhaps the more interesting; without FMA the basic outline is the same as in Kahan’s [KB01]:

integer arithmetic on the representation of the given floating-point number;
root-finding to a third of the precision;
rounding to a third of the precision, to get an exact cube;
one step of a high-order method using that exact cube.

Using a clever root-finding method in step 2, one can lower the misrounding rate while retaining or improving performance. In this case I dug up an irrational method,

𝑎 ↦ ½𝑎 + √(¼𝑎² + 𝑏/(3𝑎)) to approximate ∛(𝑎³+𝑏),

from a 17th century book by Thomas Fantet de Lagny, [Fan92]; perhaps surprisingly since it has both a division and a square root, its performance when rewritten appropriately (so as to schedule the division as early as possible, and avoiding a serial dependency between the square root and the division, see appendix D) is similar to that of the better known rational method (nowadays often known as Halley’s, but both methods are considered by Halley, and both are due to Lagny when applied to the cube root; see the discussion in part I). This is because the rational method has a divisor of higher degree, so that its division cannot be scheduled early. The error of that irrational method is half that of the rational method.

Optimizing the coefficients to minimize the error at the end of step 2, in a manner inspired by tweets by Stephen Canon (two twitter threads, [Can18a] and [Can18b]) gains another two bits on the error of that irrational method, at no cost.

With a rational method of order 5 in step 4, the method achieves a misrounding rate of 4.564(68) per million, which is easily corrected at essentially no average cost (the rate of passage in the slow “potential misrounding” path is 2.6480(52)⋅10⁻⁴, and the latency of the slow path is less than ten times that of the fast path).

Remarks on the faithful method with FMA.

The key idea is in njuffa’s answer to this question: step 4 can be performed with only an exact square, without requiring an exact cube, so that step 2 should aim for, and step 3 should round to, half of the precision rather than a third.

Contrary to njuffa I used double throughout. In step 2, I went for an irrational method of order 5 (obtained by generalizing Lagny’s methods; see part I and appendix B), and in step 4, for a rational method of order 4 rather than 3.

The resulting method has a misrounding rate of 6.10(25) per billion, and a rate of passage in the “potential misrounding” path of 3.05(18)⋅10⁻⁷.