How are double-precision floating-point numbers converted to single-precision floating-point format?

https://stackoverflow.com/questions/11772776

24-06-2021
|

Question

Converting numbers from double-precision floating-point format to single-precision floating-point format results in loss of precision. What's the algorithm used to achieve this conversion?

Are numbers greater than 3.4028234e+38 or lesser than -3.4028234e+38 simply reduced to the respective limits? I feel that the conversion process is a bit more involved than this but I couldn't find documentation for it.

Solution

The most common floating-point formats are the binary floating-point formats specified in the IEEE 754 standard. I will answer your question for these formats. There are also decimal floating-point formats in the new (2008) version of the standard, and there are formats other than the IEEE 754 standard, but the 754 binary formats are by far the most common. Some information about rounding, and links to the standard, are in this Wikipedia page.

Converting double precision to single precision is treated the same as rounding the result of any operation. (E.g., an addition, multiplication, or square root has an exact mathematical value, and that value is rounded according to the rules to produce the result returned from the operation. For purposes of conversion, the input value is the exact mathematical value, and it is rounded.)

Generally, the computing environment has some default rounding mode. (Various programming languages may provide ways to change the default rounding mode or to specify it particularly with each operation.) The default rounding mode is commonly round-to-nearest. Others are round-toward-zero, round-toward-positive-infinity (upward), and round-toward-negative-infinity (downward).

In round-to-nearest mode, the representable number nearest the exact value is returned. If there is a tie, then the number with the even low bit (in its fraction or significand) is returned. For this purpose, infinity effectively acts as if it were the next value in the pattern of finite numbers. In single-precision, the greatest finite numbers are 0x1.fffff8p127, 0x1.fffffap127, 0x1.fffffcp127, and 0x1.fffffep127. (There are 24 bits in the single-precision significand, so a step in that bit is a step of 2 in that last hexadecimal digit.) For rounding purposes, infinity acts as if it were at 0x2p128. So, if the exact result is closer to 0x1.fffffep127 (thus, less than 0x1.ffffffp127), it is rounded to 0x1.fffffep127. If it is greater than or equal to 0x1.ffffffp127, it is rounded to infinity. The situation for negative infinity is symmetric.

In round-toward-infinity mode, the nearest representable value that is greater than or equal to the exact value is returned. So any value above 0x1.fffffep127 rounds to infinity. Round-toward-negative-infinity returns the nearest representable value that is less than or equal to the exact vaue. Round-toward-zero returns the nearest representable value in the direction toward zero.

The IEEE 754 standard only specifies the result; it does not specify the algorithm. The method used to achieve the rounding is up to each implementation.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow