floating point accuracy in MIPS assembly

https://stackoverflow.com/questions/22451740

15-06-2023
|

题

I wrote two code files in MIPS assembly for the expression below:

R(n) = (i to n) SUM { (i+2)/(i+1- 1/i) - i/(i+ 1/i) }

One code calculates the whole expression R(n) as summation and gives the result.

The second code first calculates first term, i.e., (i+2)/(i+1- 1/i) in a loop and then calculates the second term, i.e., i/(i+ 1/i) in another loop. It then simply subtracts the two summations.

Following are the results for the two programs for different values of n:

Program 1:

 N        Result
-----------
10        5.07170725

100       7.41927338

1000      9.72636795

10000    12.02908134

100000   14.33149338

1000000  16.63462067

Program 2:

 N       Result
---------
10       5.07170773

100      7.41923523

1000     9.72259521

10000   12.31250000

100000   8.61718750

1000000  6.50000000

Program 1 is giving more accurate results (compared with Wolfram Alpha results for R(n)). Why does Program 2 gives odd results here for large values of n? My question is related to floating point precision here.

Note: I am using single-precision numbers.

解决方案

Say you have un=an-bn and you want sum(un)

lim an -> 1 when n -> infinity so the sum of P terms tends to P+cte_a, same for bn, the sum tends to P+cte_b

When you differentiate the two, (P+cte_a) - (P+cte_b), you should mathematically retrieve sum(un).

But with floating point, that's not what happens, because (P+cte_a) is rounded to nearest float. And the bigger P is, the less float(P+cte_a)-float(P) will be close to cte_a...

To convince yourself, try to evaluate these ops:

10.0f+0.1f-10.0f
100.0f+0.1f-100.0f
...
1.0e7f+0.1f-1.0e7

lim un -> 1/n when n -> infinity, so program 1 does a bit better...

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow