double division by using two floats?

https://stackoverflow.com/questions/4097067

28-09-2019
|

Pergunta

I would like to do a double devision using two floats (It appears that Direct Compute does not support double devision).

Is that possible?

This is what I tried so far (c# code, should be HLSL later):

int count = 7;
double value = 0.0073812398871474;
float f1 = (float)value;
float f2 = (float)((value - f1));
float r1 = f1 / count;
float r2 = f2 / count;
double result = (double)r1 + (double)r2;

0,00105446285765182 (result)

0,00105446284102106 (correct result)

It has to do with the rounding in f1. If value is instead:

 double value = 0.0073812344471474;

Then the result is correct.

Solução

Calculate reciprocal of count with float division and then improve the precision to full double using Newton-Raphson reciprocal formula.

int count = 7;
double value = 0.0073812398871474;
double r = (double) (1.0f / count); // approximate reciprocal
r = r * (2.0 - count*r); // much better approximation
r = r * (2.0 - count*r); // should be full double precision by now.
double result = value * r;

Outras dicas

Apparently your arithmetic error is not immediately clear to you. Let me spell it out.

Suppose a double has two parts, the big part and the little part, each with roughly 32 bits of precision. (This is not exactly how doubles work but it will do for our purposes.)

A float only has one part.

Imagine we were doing it 32 bits at a time but keeping everything in doubles:

double divisor = whatever;
double dividend = dividendbig + dividendlittle;
double bigquotient = dividendbig / divisor;

what is bigquotient? It's a double. So it has two parts. bigquotient is equal to bigquotientbig + bigquotientlittle. Continuing on:

double littlequotient = dividendlittle / divisor;

again, littlequotient is littlequotientbig + littlequotientlittle. Now we add the quotients:

double quotient = bigquotient + littlequotient;

How do we compute that? quotient has two parts. quotientbig will be set to bigquotientbig. quotientlittle will be set to bigquotientlittle + littlequotientbig. littlequotientlittle gets discarded.

Now suppose you do it in floats. You have:

float f1 = dividendbig;
float f2 = dividendlittle;
float r1 = f1 / divisor;

OK, what is r1? It's a float. So it only has one part. r1 is bigquotientbig.

float r2 = f2 / divisor;

What is r2? It's a float. So it only has one part. r2 is littlequotientbig.

double result = (double)r1 + (double)r2;

You add them together and you get bigquotientbig + littlequotientbig. What happened to bigquotientlittle? You've lost 32 bits of precision in there, and so it should come as no surprise that you get innaccuracies 32 bits along the way. You have not come up with at all the right algorithm for approximating 64 bit arithmetic in 32 bits.

In order to compute (big + little)/divisor, you can't simply do (big / divisor) + (little / divisor). That rule of algebra does not apply when you are rounding during every division!

Is that now clear?

Is that possible?

Yes, as long as you:

Accept the inevitable loss of precision
Bear in mind that not all doubles fit into floats in the first place

Update

After reading your comments (double precision is a requirement), my updated answer is:

No.

So how about something like

result = value * (double)(1f / (float)count); ?

There you're only dividing two floats. I have more casts in there than needed, but it's the concept that counts.

Edit:
Okay, so you're worried about the difference between the actual and the rounded, right? so just do it over and over until you get it right!

double result = 0;
double difference = value;
double total = 0;
float f1 = 0;
while (difference != 0)
{
    f1 = (float)difference;
    total += f1;
    difference = value - total;
    result += (double)(f1 / count);
}

...but you know, the easy answer still is "No". This still doesn't even catch ALL the rounding errors. From my tests it lowers the inaccuracies to 1e-17 at the most, about 30% of the time.

In a comment, you say:

Of course there should not be any loss of precision. This is why I'm using two floats. If I would accept loss of precision, then I could just cast two float and do the division.

An IEEE-754 single precision value has 24 significant binary digits. A double precision value has 53 significant digits. You can't even represent a double precision value as two single precision values without loss of accuracy, much less do arithmetic with such a representation.

That said, it is possible to do a correctly rounded double precision division using only conversions between double and single, double precision subtraction/addition, and single precision operations, but it's pretty complicated if you really want to do it right. Do you need actual IEEE-754 correct rounding, or just an answer that's correct up to the last bit or two?

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow