문제

in my code I often compute things like the following piece (here C code for simplicity):

float cos_theta = /* some simple operations; no cosf call! */;
float sin_theta = sqrtf(1.0f - cos_theta * cos_theta); // Option 1

For this example ignore that the argument of the square root might be negative due to imprecisions. I fixed that with additional fdimf call. However, I wondered if the following is more precise:

float sin_theta = sqrtf((1.0f + cos_theta) * (1.0f - cos_theta)); // Option 2

cos_theta is between -1 and +1 so for each choice there will be situations where I subtract similar numbers and thus will loose precision, right? What is the most precise and why?

도움이 되었습니까?

해결책

The most precise way with floats is likely to compute both sin and cos using a single x87 instruction, fsincos.

However, if you need to do the computation manually, it's best to group arguments with similar magnitudes. This means the second option is more precise, especially when cos_theta is close to 0, where precision matters the most.

As the article What Every Computer Scientist Should Know About Floating-Point Arithmetic notes:

The expression x2 - y2 is another formula that exhibits catastrophic cancellation. It is more accurate to evaluate it as (x - y)(x + y).

Edit: it's more complicated than this. Although the above is generally true, (x - y)(x + y) is slightly less accurate when x and y are of very different magnitudes, as the footnote to the statement explains:

In this case, (x - y)(x + y) has three rounding errors, but x2 - y2 has only two since the rounding error committed when computing the smaller of x2 and y2 does not affect the final subtraction.

In other words, taking x - y, x + y, and the product (x - y)(x + y) each introduce rounding errors (3 steps of rounding error). x2, y2, and the subtraction x2 - y2 also each introduce rounding errors, but the rounding error obtained by squaring a relatively small number (the smaller of x and y) is so negligible that there are effectively only two steps of rounding error, making the difference of squares more precise.

So option 1 is actually going to be more precise. This is confirmed by dev.brutus's Java test.

다른 팁

I wrote small test. It calcutates expected value with double precision. Then it calculates an error with your options. The first option is better:

Algorithm: FloatTest$1
option 1 error = 3.802792362162126
option 2 error = 4.333273185303996
Algorithm: FloatTest$2
option 1 error = 3.802792362167937
option 2 error = 4.333273185305868

The Java code:

import org.junit.Test;

public class FloatTest {

    @Test
    public void test() {
        testImpl(new ExpectedAlgorithm() {
            public double te(double cos_theta) {
                return Math.sqrt(1.0f - cos_theta * cos_theta);
            }
        });
        testImpl(new ExpectedAlgorithm() {
            public double te(double cos_theta) {
                return Math.sqrt((1.0f + cos_theta) * (1.0f - cos_theta));
            }
        });
    }

    public void testImpl(ExpectedAlgorithm ea) {
        double delta1 = 0;
        double delta2 = 0;
        for (double cos_theta = -1; cos_theta <= 1; cos_theta += 1e-8) {
            double[] delta = delta(cos_theta, ea);
            delta1 += delta[0];
            delta2 += delta[1];
        }

        System.out.println("Algorithm: " + ea.getClass().getName());
        System.out.println("option 1 error = " + delta1);
        System.out.println("option 2 error = " + delta2);
    }

    private double[] delta(double cos_theta, ExpectedAlgorithm ea) {
        double expected = ea.te(cos_theta);
        double delta1 = Math.abs(expected - t1((float) cos_theta));
        double delta2 = Math.abs(expected - t2((float) cos_theta));

        return new double[]{delta1, delta2};
    }

    private double t1(float cos_theta) {
        return Math.sqrt(1.0f - cos_theta * cos_theta);
    }

    private double t2(float cos_theta) {
        return Math.sqrt((1.0f + cos_theta) * (1.0f - cos_theta));
    }

    interface ExpectedAlgorithm {
        double te(double cos_theta);
    }

}

The correct way to reason about numerical precision of some expression is to:

  1. Measure the result discrepancy relative to the correct value in ULPs (Unit in the last place), introduced in 1960. by W. H. Kahan. You can find C, Python & Mathematica implementations here, and learn more on the topic here.
  2. Discriminate between two or more expressions based on the worst case they produce, not average absolute error as done in other answers or by some other arbitrary metric. This is how numerical approximation polynomials are constructed (Remez algorithm), how standard library methods' implementations are analysed (e.g. Intel atan2), etc...

With that in mind, version_1: sqrt(1 - x * x) and version_2: sqrt((1 - x) * (1 + x)) produce significantly different outcomes. As presented in the plot below, version_1 demonstrates catastrophic performance for x close to 1 with error > 1_000_000 ulps, while on the other hand error of version_2 is well behaved.

That is why I always recommend using version_2, i.e. exploiting the square difference formula.

enter image description here

Python 3.6 code that produces square_diff_error.csv file:

from fractions import Fraction
from math import exp, fabs, sqrt
from random import random
from struct import pack, unpack


def ulp(x):
    """
    Computing ULP of input double precision number x exploiting
    lexicographic ordering property of positive IEEE-754 numbers.

    The implementation correctly handles the special cases:
      - ulp(NaN) = NaN
      - ulp(-Inf) = Inf
      - ulp(Inf) = Inf

    Author: Hrvoje Abraham
    Date: 11.12.2015
    Revisions: 15.08.2017
               26.11.2017
    MIT License https://opensource.org/licenses/MIT

    :param x: (float) float ULP will be calculated for
    :returns: (float) the input float number ULP value
    """

    # setting sign bit to 0, e.g. -0.0 becomes 0.0
    t = abs(x)

    # converting IEEE-754 64-bit format bit content to unsigned integer
    ll = unpack('Q', pack('d', t))[0]

    # computing first smaller integer, bigger in a case of ll=0 (t=0.0)
    near_ll = abs(ll - 1)

    # converting back to float, its value will be float nearest to t
    near_t = unpack('d', pack('Q', near_ll))[0]

    # abs takes care of case t=0.0
    return abs(t - near_t)


with open('e:/square_diff_error.csv', 'w') as f:
    for _ in range(100_000):
        # nonlinear distribution of x in [0, 1] to produce more cases close to 1
        k = 10
        x = (exp(k) - exp(k * random())) / (exp(k) - 1)

        fx = Fraction(x)
        correct = sqrt(float(Fraction(1) - fx * fx))

        version1 = sqrt(1.0 - x * x)
        version2 = sqrt((1.0 - x) * (1.0 + x))

        err1 = fabs(version1 - correct) / ulp(correct)
        err2 = fabs(version2 - correct) / ulp(correct)

        f.write(f'{x},{err1},{err2}\n')

Mathematica code that produces the final plot:

data = Import["e:/square_diff_error.csv"];

err1 = {1 - #[[1]], #[[2]]} & /@ data;
err2 = {1 - #[[1]], #[[3]]} & /@ data;

ListLogLogPlot[{err1, err2}, PlotRange -> All, Axes -> False, Frame -> True,
    FrameLabel -> {"1-x", "error [ULPs]"}, LabelStyle -> {FontSize -> 20}]

As an aside, you will always have a problem when theta is small, because the cosine is flat around theta = 0. If theta is between -0.0001 and 0.0001 then cos(theta) in float is exactly one, so your sin_theta will be exactly zero.

To answer your question, when cos_theta is close to one (corresponding to a small theta), your second computation is clearly more accurate. This is shown by the following program, that lists the absolute and relative errors for both computations for various values of cos_theta. The errors are computed by comparing against a value which is computed with 200 bits of precision, using GNU MP library, and then converted to a float.

#include <math.h>
#include <stdio.h>
#include <gmp.h>

int main() 
{
  int i;
  printf("cos_theta       abs (1)    rel (1)       abs (2)    rel (2)\n\n");
  for (i = -14; i < 0; ++i) {
    float x = 1 - pow(10, i/2.0);
    float approx1 = sqrt(1 - x * x);
    float approx2 = sqrt((1 - x) * (1 + x));

    /* Use GNU MultiPrecision Library to get 'exact' answer */
    mpf_t tmp1, tmp2;
    mpf_init2(tmp1, 200);  /* use 200 bits precision */
    mpf_init2(tmp2, 200);
    mpf_set_d(tmp1, x);
    mpf_mul(tmp2, tmp1, tmp1);  /* tmp2 = x * x */
    mpf_neg(tmp1, tmp2);        /* tmp1 = -x * x */
    mpf_add_ui(tmp2, tmp1, 1);  /* tmp2 = 1 - x * x */
    mpf_sqrt(tmp1, tmp2);       /* tmp1 = sqrt(1 - x * x) */
    float exact = mpf_get_d(tmp1);

    printf("%.8f     %.3e  %.3e     %.3e  %.3e\n", x,
           fabs(approx1 - exact), fabs((approx1 - exact) / exact),
           fabs(approx2 - exact), fabs((approx2 - exact) / exact));
    /* printf("%.10f  %.8f  %.8f  %.8f\n", x, exact, approx1, approx2); */
  }
  return 0;
}

Output:

cos_theta       abs (1)    rel (1)       abs (2)    rel (2)

0.99999988     2.910e-11  5.960e-08     0.000e+00  0.000e+00
0.99999970     5.821e-11  7.539e-08     0.000e+00  0.000e+00
0.99999899     3.492e-10  2.453e-07     1.164e-10  8.178e-08
0.99999684     2.095e-09  8.337e-07     0.000e+00  0.000e+00
0.99998999     1.118e-08  2.497e-06     0.000e+00  0.000e+00
0.99996835     6.240e-08  7.843e-06     9.313e-10  1.171e-07
0.99989998     3.530e-07  2.496e-05     0.000e+00  0.000e+00
0.99968380     3.818e-07  1.519e-05     0.000e+00  0.000e+00
0.99900001     1.490e-07  3.333e-06     0.000e+00  0.000e+00
0.99683774     8.941e-08  1.125e-06     7.451e-09  9.376e-08
0.99000001     5.960e-08  4.225e-07     0.000e+00  0.000e+00
0.96837723     1.490e-08  5.973e-08     0.000e+00  0.000e+00
0.89999998     2.980e-08  6.837e-08     0.000e+00  0.000e+00
0.68377221     5.960e-08  8.168e-08     5.960e-08  8.168e-08

When cos_theta is not close to one, then the accuracy of both methods is very close to each other and to round-off error.

[Edited for major think-o] It looks to me like option 2 will be better, because for a number like 0.000001 for example option 1 will return the sine as 1 while option will return a number just smaller than 1.

No difference in my option since (1-x) preserves the precision not effecting the carried bit. Then for (1+x) the same is true. Then the only thing effecting the carry bit precision is the multiplication. So in both cases there is one single multiplication, so they are both as likely to give the same carry bit error.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top