floating point output length in bits

https://stackoverflow.com//questions/20019968

electronics

21-12-2019
|

Question

Although it is simple, I could not find it through googling. As we know that floating point representation supports larger range of numbers and the operation is slower than with pure integers. Also we know that how mantissa and exponents are used to represent a floating point. But my question is that, say, in a 32-bit system, s* b^e the output is longer than 32-bit or 32 bit? (where s=significand, b=base,e=exponent)

Solution

The exact number of bits used to represent the mantissa and exponent of floating point numbers varies from CPU to CPU, so the question is certainly architecture dependant.

There is one standard which is very dominant: IEEE Floating Point, and according to this related SO question all major CPUs you are likely to meet today implement it, including IA32.

According to Wikipedia, IEEE Floating Point requires that at least one of the following be available on any implementation:

binary32: 24 bit mantissa (including 1 sign bit) and 8 bit exponent
binary64: 53 bit mantissa (including 1 sign bit) and 11 bit exponent

Assuming that by 32 bit you mean the IA 32 family (which is only one of the few architectures which uses 32 bits), then the floating point registers can contain up to 80 bits, meaning that, binary32, binary64 and a non IEC 80 bit format with 15 exponent bits can all be supported. To differentiate between 32 bit and 64 bit instructions most assemblers use size modifiers like QWORD or DWORD.

Also, besides CPU implementations, languages can also require compliance to IEEE Floating Point. In the C language for example, Annex 7 specifies that if the macro __STDC_IEC_599__ is defined automatically by the compiler then float is guaranteed to be binary32 and double binary64. long double is not fixed to be IEC, so it could then use 80 bits on an IA32 arch.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow