floating point output length in bits
-
21-12-2019 - |
Question
Although it is simple, I could not find it through googling. As we know that floating point representation supports larger range of numbers and the operation is slower than with pure integers. Also we know that how mantissa and exponents are used to represent a floating point. But my question is that, say, in a 32-bit system, s* b^e the output is longer than 32-bit or 32 bit? (where s=significand, b=base,e=exponent)
Solution
The exact number of bits used to represent the mantissa and exponent of floating point numbers varies from CPU to CPU, so the question is certainly architecture dependant.
There is one standard which is very dominant: IEEE Floating Point, and according to this related SO question all major CPUs you are likely to meet today implement it, including IA32.
According to Wikipedia, IEEE Floating Point requires that at least one of the following be available on any implementation:
- binary32: 24 bit mantissa (including 1 sign bit) and 8 bit exponent
- binary64: 53 bit mantissa (including 1 sign bit) and 11 bit exponent
Assuming that by 32 bit you mean the IA 32 family (which is only one of the few architectures which uses 32 bits), then the floating point registers can contain up to 80 bits, meaning that, binary32, binary64 and a non IEC 80 bit format with 15 exponent bits can all be supported. To differentiate between 32 bit and 64 bit instructions most assemblers use size modifiers like QWORD
or DWORD
.
Also, besides CPU implementations, languages can also require compliance to IEEE Floating Point. In the C language for example, Annex 7 specifies that if the macro __STDC_IEC_599__
is defined automatically by the compiler then float
is guaranteed to be binary32 and double
binary64. long double
is not fixed to be IEC, so it could then use 80 bits on an IA32 arch.