Can every float be expressed exactly as a double?

https://stackoverflow.com/questions/259015

06-07-2019
|

Question

Can every possible value of a float variable can be represented exactly in a double variable?

In other words, for all possible values X will the following be successful:

float f1 = X;
double d = f1;
float f2 = (float)d;

if(f1 == f2)
  System.out.println("Success!");
else
  System.out.println("Failure!");

My suspicion is that there is no exception, or if there is it is only for an edge case (like +/- infinity or NaN).

Edit: Original wording of question was confusing (stated two ways, one which would be answered "no" the other would be answered "yes" for the same answer). I've reworded it so that it matches the question title.

Solution

Yes.

Proof by enumeration of all possible cases:

public class TestDoubleFloat  {
    public static void main(String[] args) {
        for (long i = Integer.MIN_VALUE; i <= Integer.MAX_VALUE; i++) {
            float f1 = Float.intBitsToFloat((int) i);
            double d = (double) f1;
            float f2 = (float) d;
            if (f1 != f2) {
                if (Float.isNaN(f1) && Float.isNaN(f2)) {
                    continue; // ok, NaN
                }
                fail("oops: " + f1 + " != " + f2);
            }
        }
    }
}

finishes in 12 seconds on my machine. 32 bits are small.

OTHER TIPS

In theory, there is not such a value, so "yes", every float should be representable as a double.. Converting from a float to a double should involve just tacking four bytes of 00 on the end -- they are stored using the same format, just with different sized fields.

Yes, floats are a subset of doubles. Both floats and doubles have the form (sign * a * 2^b). The difference between floats and doubles is the number of bits in a & b. Since doubles have more bits available, assigning a float value to a double effectively means inserting extra 0 bits.

As everyone has already said, "no". But that's actually a "yes" to the question itself, i.e. every float can be exactly expressed as a double. Confusing. :)

If I'm reading the language specification correctly (and as everyone else is confirming), there is no such value.

That is, each claims only to hold only IEEE 754 standard values, so casts between the two should incur no change except in memory given.

(clarification: There would be no change as long as the value was small enough to be held in a float; obviously if the value was too many bits to be held in a float to begin with, casting from double to float would result in a loss of precision.)

@KenG: This code:

float a = 0.1F
println "a=${a}"
double d = a
println "d=${d}"

fails not because 0.1f can't be exactly represented. The question was "is there a float value that cannot be represented as a double", which this code doesn't prove. Although 0.1f can't be stored exactly, the value that a is given (which isn't 0.1f exactly) can be stored as a double (which also won't be 0.1f exactly). Assuming an Intel FPU, the bit pattern for a is:

0 01111011 10011001100110011001101

and the bit pattern for d is:

0 01111111011 100110011001100110011010 (followed by lots more zeros)

which has the same sign, exponent (-4 in both cases) and the same fractional part (separated by spaces above). The difference in the output is due to the position of the second non-zero digit in the number (the first is the 1 after the point) which can only be represented with a double. The code that outputs the string format stores intermediate values in memory and is specific to floats and doubles (i.e. there is a function double-to-string and another float-to-string). If the to-string function was optimised to use the FPU stack to store the intermediate results of the to-string process, the output would be the same for float and double since the FPU uses the same, larger format (80bits) for both float and double.

There are no float values that can't be stored identically in a double, i.e. the set of float values is a sub-set of the the set of double values.

Snark: NaNs will compare differently after (or indeed before) conversion.

This does not, however, invalidate the answers already given.

I took the code you listed and decided to try it in C++ since I thought it might execute a little faster and it is significantly easier to do unsafe casting. :-D

I found out that for valid numbers, the conversion works and you get the exact bitwise representation after the cast. However, for non-numbers, e.g. 1.#QNAN0, etc., the result will use a simplified representation of the non-number rather than the exact bits of the source. For example:

**** FAILURE **** 2140188725 | 1.#QNAN0 -- 0xa0000000 0x7ffa1606

I cast an unsigned int to float then to double and back to float. The number 2140188725 (0x7F90B035) results in a NAN and converting to double and back is still a NAN but not the exact same NAN.

Here is the simple C++ code:

typedef unsigned int uint;
for (uint i = 0; i < 0xFFFFFFFF; ++i)
{
    float f1 = *(float *)&i;
    double d = f1;
    float f2 = (float)d;
    if(f1 != f2)
        printf("**** FAILURE **** %u | %f -- 0x%08x 0x%08x\n", i, f1, f1, f2);
    if ((i % 1000000) == 0)
        printf("Iteration: %d\n", i);
}

The answer to the first question is yes, the answer to the 'in other words', however is no. If you change the test in the code to be if (!(f1 != f2)) the answer to the second question becomes yes -- it will print 'Success' for all float values.

In theory every normal single can have the exponent and mantissa padded to create a double and then remove the padding and you return to the original single.

When you go from theory to reality is when you will have problems. I dont know if you were interested in theory or implementation. If it is implementation then you can rapidly get into trouble.

IEEE is a horrible format, my understanding it was intentionally designed to be so tough that nobody could meet it and allow the market to catch up to intel (this was a while back) allowing for more competition. If that is true it failed, either way we are stuck with this dreadful spec. Something like the TI format is far superior for the real world in so many ways. I have no connection to either company or any of these formats.

Thanks to this spec there are very few if any fpus that actually meet it (in hardware or even in hardware plus the operating system), and those that do often fail on the next generation. (google: TestFloat). The problems these days tend to lie in the int to float and float to int and not single to double and double to single as you have specified above. Of course what operation is the fpu going to perform to do that conversion? Add 0? Multiply by 1? Depends on the fpu and the compiler.

The problem with IEEE related to your question above is that there is more than one way a number, not every number but many numbers can be represented. If I wanted to break your code I would start with minus zero in the hope that one of the two operations would convert it to a plus zero. Then I would try denormals. And it should fail with a signaling nan, but you called that out as a known exception.

The problem is that equal sign, here is rule number one about floating point, never use an equal sign. Equals is a bit comparison not a value comparison, if you have two values represented in different ways (plus zero and minus zero for example) the bit comparison will fail even though its the same number. Greater than and less than are done in the fpu, equals is done with the integer alu.

I realize that you probably used the equal to explain the problem and not necessarily the code you wanted to succeed or fail.

If a floating-point type is viewed as representing a precise value, then as other posters have noted, every float value is representable as a double, but only a few values of double can be represented by float. On the other hand, if one recognizes that floating-point values are approximations, one will realize the real situation is reversed. If one uses a very precise instrument to measure something which is 3.437mm, one may correctly describe is size as 3.4mm. if one uses a ruler to measure the object as 3.4mm, it would be incorrect to describe its size as 3.400mm.

Even bigger problems exist at the top of the range. There is a float value that represents: "computed value exceeded 2^127 by an unknown amount", but there's no double value that indicates such a thing. Casting an "infinity" from single to double will yield a value "computed value exceeded 2^1023 by an unknown amount" which is off by a factor of over a googol.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow