Implementing a half precision floating point number in C++

https://stackoverflow.com/questions/18043974

23-06-2022
|

Question

I am trying to implement a simple half precision floating point type, entirely for storage purposes (no arithmetic, converts to double implicitly), but I get weird behavior. I get completely wrong values for Half between -0.5 and 0.5. Also I get a nasty "offset" for values, for example 0.8 is decoded as 0.7998.

I am very new to C++, so I would be great if you can point out my mistake and help me with improving the accuracy a bit. I am also curious how portable is this solution. Thanks!

Here is the output - double value and actual decoded value from the half:

-1 -1
-0.9 -0.899902
-0.8 -0.799805
-0.7 -0.699951
-0.6 -0.599854
-0.5 -0.5
-0.4 -26208
-0.3 -19656
-0.2 -13104
-0.1 -6552
-1.38778e-16 -2560
0.1 6552
0.2 13104
0.3 19656
0.4 26208
0.5 32760
0.6 0.599854
0.7 0.699951
0.8 0.799805
0.9 0.899902

Here is the code so far:

#include <stdint.h>
#include <cmath>
#include <iostream>

using namespace std;

#define EXP 4
#define SIG 11

double normalizeS(uint v) {
    return (0.5f * v / 2048 + 0.5f);
}

uint normalizeP(double v) {
    return (uint)(2048 * (v - 0.5f) / 0.5f);
}

class Half {

    struct Data {
        unsigned short sign : 1;
        unsigned short exponent : EXP;
        unsigned short significant : SIG;
    };

public:
    Half() {}
    Half(double d) { loadFromFloat(d); }

    Half & operator = (long double d) {
        loadFromFloat(d);
        return *this;
    }

    operator double() {
        long double sig = normalizeS(_d.significant);
        if (_d.sign) sig = -sig;
        return ldexp(sig, _d.exponent /*+ 1*/);
    }

private:
    void loadFromFloat(long double f) {
        long double v;
        int exp;
        v = frexp(f, &exp);
        v < 0 ? _d.sign = 1 : _d.sign = 0;
        _d.exponent = exp/* - 1*/;
        _d.significant = normalizeP(fabs(v));
    }

    Data _d;
};

int main() {

        Half a[255];

        double d = -1;

        for (int i = 0; i < 20; ++i) {
            a[i] = d;
            cout << d << " " << a[i] << endl;
            d += 0.1;
        }
}

Solution

I ended up with a very simple (naive really) solution, capable of representing every value in the range I need: 0 - 64 with precision of 0.001.

Since the idea is to use it for storage, this is actually better because it allows conversion from and to double without any resolution loss. It is also faster. It actually loses some resolution (less than 16 bit) in the name of having a nicer minimum step so it can represent any of the input values without approximation - so in this case LESS is MORE. Using the full 2^10 resolution for the floating component would result in an odd step that cannot represent decimal values accurately.

class Half {
public:
    Half() {}
    Half(const double d) { load(d); }
    operator double() const { return _d.i + ((double)_d.f / 1000); }
private:
    struct Data {
        unsigned short i : 6;
        unsigned short f : 10;
    };
    void load(const double d) {
        int i = d;
        _d.i = i;
        _d.f = round((d - i) * 1000);
    }
    Data _d;
};

OTHER TIPS

Last solution wrong... Sorry...

Try to change the expoent to signed... It worked here.

The problem is that when the expoent turn to be negative, when value < 0.5 you save the expoent as a positive number, it is the problem that cause the number to be big when abs(val)<0.5.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow