Converting double to float without relying on the FPU rounding mode

https://stackoverflow.com/questions/2011959

19-09-2019
|

Question

Does anyone have handy the snippets of code to convert an IEEE 754 double to the immediately inferior (resp. superior) float, without changing or assuming anything about the FPU's current rounding mode?

Note: this constraint probably implies not using the FPU at all. I expect the simplest way to do it in these conditions is to read the bits of the double in a 64-bit long and to work with that.

You can assume the endianness of your choice for simplicity, and that the double in question is available through the d field of the union below:

union double_bits
{
  long i;
  double d;
};

I would try to do it myself but I am certain I would introduce hard-to-notice bugs for denormalized or negative numbers.

Solution

I think the following works, but I will state my assumptions first:

floating-point numbers are stored in IEEE-754 format on your implementation,
No overflow,
You have nextafterf() available (it's specified in C99).

Also, most likely, this method is not very efficient.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char *argv[])
{
    /* Change to non-zero for superior, otherwise inferior */
    int superior = 0;

    /* double value to convert */
    double d = 0.1;

    float f;
    double tmp = d;

    if (argc > 1)
        d = strtod(argv[1], NULL);

    /* First, get an approximation of the double value */
    f = d;

    /* Now, convert that back to double */
    tmp = f;

    /* Print the numbers. %a is C99 */
    printf("Double: %.20f (%a)\n", d, d);
    printf("Float: %.20f (%a)\n", f, f);
    printf("tmp: %.20f (%a)\n", tmp, tmp);

    if (superior) {
        /* If we wanted superior, and got a smaller value,
           get the next value */
        if (tmp < d)
            f = nextafterf(f, INFINITY);
    } else {
        if (tmp > d)
            f = nextafterf(f, -INFINITY);
    }
    printf("converted: %.20f (%a)\n", f, f);

    return 0;
}

On my machine, it prints:

Double: 0.10000000000000000555 (0x1.999999999999ap-4)
Float: 0.10000000149011611938 (0x1.99999ap-4)
tmp: 0.10000000149011611938 (0x1.99999ap-4)
converted: 0.09999999403953552246 (0x1.999998p-4)

The idea is that I am converting the double value to a float value—this could be less than or greater than the double value depending upon the rounding mode. When converted back to double, we can check if it is smaller or greater than the original value. Then, if the value of the float is not in the right direction, we look at the next float number from the converted number in the original number's direction.

OTHER TIPS

To do this job more accurately than just re-combine mantissa and exponent bit's check this out:

http://www.mathworks.com/matlabcentral/fileexchange/23173

regards

I posted code to do this here: https://stackoverflow.com/q/19644895/364818 and copied it below for your convenience.

    // d is IEEE double, but double is not natively supported.
    static float ConvertDoubleToFloat(void* d)
    {
        unsigned long long x;
        float f; // assumed to be IEEE float
        unsigned long long sign ;
        unsigned long long exponent;
        unsigned long long mantissa;

        memcpy(&x,d,8);

        // IEEE binary64 format (unsupported)
        sign     = (x >> 63) & 1; // 1
        exponent = ((x >> 52) & 0x7FF); // 11
        mantissa = (x >> 0) & 0x000FFFFFFFFFFFFFULL; // 52
        exponent -= 1023;

        // IEEE binary32 format (supported)
        exponent += 127; // rebase
        exponent &= 0xFF;
        mantissa >>= (52-23); // left justify

        x = mantissa | (exponent << 23) | (sign << 31);
        memcpy(&f,&x,4);

        return f;
    }

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow