Does IEEE-754 float, double and quad guarantee exact representation of -2, -1, -0, 0, 1, 2?

https://stackoverflow.com//questions/20029443

21-12-2019
|

Domanda

All is in the title: does IEEE-754 float, double and quad guarantee exact representation of -2, -1, -0, 0, 1, 2 ?

Soluzione

It guarantees precise representations of all integers until the number of significant binary digits exceeds the range of the mantissa.

Altri suggerimenti

IEEE 754 floating point numbers can be used to store precisely integers of a certain ranges. For example:

binary32, implemented in C/C++ as float, provides 24 bits of precision and therefore can represent with full precision 16-bit integers, e.g. short int;
binary64, implemented in C/C++ as double, provides 53 bits of precision and can represent exactly 32-bit integers, e.g. int;
the non-standard Intel 80-bit precision, implemented as long double by some x86/x64 compilers, provides 64 significant bits and can represent 64-bit integers, e.g. long int (on LP64 systems, e.g. Unix) or long long int (on LLP64 systems, e.g. Windows);
binary128, implemented as compiler-specific types such as __float128 (GCC) or _Quad (Intel C/C++), provides 113 bits in the mantissa and therefore can represent exactly 64-bit integers.

The fact that double fits an extended range of integers, even surpassing the range of 32-bit integers, is used in JavaScript, which doesen't have special integer numerical type and instead uses double precision floating-point to represent integers.

One quirk of floating-point numbers is that they have separate sign bit and therefore things like positive and negative zeros exist, which is not possible in the two's complement signed integer representation.

Simple way to get answer for any decimal number, convert the absolute value to binary (24 bits for float, 53 bits for double, 113 bits for quad), then back to decimal, and see if you get same value back.

For integers, answer is obvious, you don't lose anything, unless value is too big to fit into given number of bits.

Conversion of rational values with non-integer part is more interesting. There you may lose precision when converting to a binary with some fixed width, and when converting back to decimal, you may get a decimal value with periodic decimal expansion (or again lose precision if you round it).

Since you're dabbling with IEEE floats, first read the wikipedia page, then when you feel you're ready for more, proceed with the first external link there, "What Every Computer Scientist Should Know About Floating-Point Arithmetic".

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow