Question

Can a double (of a given number of bytes, with a reasonable mantissa/exponent balance) always fully precisely hold the range of an unsigned integer of half that number of bytes?

E.g. can an eight byte double fully precisely hold the range of numbers of a four byte unsigned int?

What this will boil down to is if a two byte float can hold the range of a one byte unsigned int.

A one byte unsigned int will of course be 0 -> 255.

Was it helpful?

Solution

An IEEE754 64-bit double can represent any 32-bit integer, simply because it has 53-odd(a) bits available for precision and the 32-bit integer only needs, well, 32 :-)

It would be plausible for a (non IEEE754 double precision) 64-bit floating point number to have less than 32 bits of precision. That would allow truly huge numbers (due to the exponent) but at the cost of precision.

The bottom line is that, provided there are more bits of precision in the mantissa of the floating point number than there are in the integer (and enough bits in the exponent to scale it), then it can be represented without loss of precision.


(a) Technically, the 53rd bit of precision is an implied 1 at the start of the sequence so the amount of "variablity" may only be 52 bits. Whether it's 52 or 53, it's still enough bits to represent every 32-bit integer.

OTHER TIPS

Yes. A float (or double) is guaranteed to exactly represent any integer that does not need to be truncated. For a double, there is 53 bits of precision, so that is more than enough to exactly represent any 32 bit integer, and a tiny (statistically speaking) proportion of 64 bit ones too.

Exactly what the range is that you can represent exactly depends on a lot of factors in your implementation, but you can lower-bound it by saying that, if the exponent field is set to 0, you can exactly represent integers up to the width of your mantissa field (assuming a sign bit). For IEEE 754 double-precision, this means you can represent 52-bit numbers exactly. In general, your mantissa will be over half the width of the overall structure.

For more details on how a double works, you might want to look at this blog post: Anatomy of a floating point number.

I wouldn't use the words "fully precisely" when talking about floating-point numbers. But yes, a double can represent a 32-bit integer.

I do not know which other combinations of floats and ints that this is also true for.

Practically speaking, you don't want to bother using floating point above what your machine supports, so just switch to rational arithmetic with bignums. That way, you're guaranteed precision.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top