Consider the following functions:

#include <iostream>
#include <iomanip>
#include <cmath>
#include <limits>

template <typename Type>
inline Type a(const Type dx, const Type a0, const Type z0, const Type b1)
{
    return (std::sqrt(std::abs(2*b1-z0))*dx)+a0;
}

template <typename Type>
inline Type b(const Type dx, const Type a0, const Type z0, const Type a1)
{
    return (std::pow((a1-a0)/dx, 2)+ z0)/2;
}

int main(int argc, char* argv[])
{
    double dx = 1.E-6;
    double a0 = 1;
    double a1 = 2;
    double z0 = -1.E7;
    double b1 = -10;
    std::cout<<std::scientific;
    std::cout<<std::setprecision(std::numeric_limits<double>::digits10);
    std::cout<<a1-a(dx, a0, z0, b(dx, a0, z0, a1))<<std::endl;
    std::cout<<b1-b(dx, a0, z0, a(dx, a0, z0, b1))<<std::endl;
    return 0;
}

On my machine, it returns:

0.000000000000000e+00
-1.806765794754028e-07

Instead of (0, 0). There is a large rounding error with the second expression.

My question is: how to reduce the rounding error of each function without changing the type (I need to keep these 2 functions declarations (but the formulas can be rearanged): they come from a larger program)?

有帮助吗?

解决方案

Sadly, all of the floating point types are notorious for rounding error. They can't even store 0.1 without it (you can prove this using long division by hand: the binary equivalent is 0b0.0001100110011001100...). You might try some workarounds like expanding that pow to a hard-coded multiplication, but you'll ultimately need to code your program to anticipate and minimize the effects of rounding error. Here are a couple ideas:

  • Never compare floating point values for equality. Some alternative comparisons I have seen include: abs(a-b) < delta, or percent_difference (a,b) < delta or even abs(a/b-1) < delta, where delta is a "suitably small" value you have determined works for this specific test.

  • Avoid adding long arrays of numbers into an accumulator; the end of the array may be completely lost to rounding error as the accumulator grows large. In "Cuda by Example" by Jason Sanders and Edward Kandrot, the authors recommend recursively adding each pair of elements individually so that each step produces an array half the size of the previous step, until you get a one-element array.

其他提示

In a(), you lose precision when you add a0 (which is exactly 1) to the small and imprecise result of sqrt()*dx.

The function b() doesn't lose any precision using the supplied values.

When you call a() before b() as in the second output, you're doing mathematical operations on a number that's already imprecise, compounding the error.

Try to structure the mathematical operations so you do operations that are less likely to create floating point errors first and those more likely to create floating point errors last.

Or, inside your functions, make sure they are operating on "long double" values. For example, the following uses floating-point promotion to promote double to long double during the first mathematical operation (pay attention to operator precedence):

template <typename Type>
inline Type a(const Type dx, const Type a0, const Type z0, const Type b1)
{
    return (std::sqrt(std::abs(2*static_cast<long double>(b1)-z0))*dx)+a0;
}
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top