Packing 32bit floats into 30 bits (c++)

https://stackoverflow.com/questions/3846317

27-09-2019
|

Question

Here are the goals I'm trying to achieve:

I need to pack 32 bit IEEE floats into 30 bits.
I want to do this by decreasing the size of mantissa by 2 bits.
The operation itself should be as fast as possible.
I'm aware that some precision will be lost, and this is acceptable.
It would be an advantage, if this operation would not ruin special cases like SNaN, QNaN, infinities, etc. But I'm ready to sacrifice this over speed.

I guess this questions consists of two parts:

1) Can I just simply clear the least significant bits of mantissa? I've tried this, and so far it works, but maybe I'm asking for trouble... Something like:

float f;
int packed = (*(int*)&f) & ~3;
// later
f = *(float*)&packed;

2) If there are cases where 1) will fail, then what would be the fastest way to achieve this?

Thanks in advance

Solution 4

I can't select any of the answers as the definite one, because most of them have valid information, but not quite what I was looking for. So I'll just summarize my conclusions.

The method for conversion I've posted in my question's part 1) is clearly wrong by C++ standard, so other methods to extract float's bits should be used.

And most important... as far as I understand from reading the responses and other sources about IEEE754 floats, it's ok to drop the least significant bits from mantissa. It will mostly affect only precision, with one exception: sNaN. Since sNaN is represented by exponent set to 255, and mantissa != 0, there can be situation where mantissa would be <= 3, and dropping last two bits would convert sNaN to +/-Infinity. But since sNaN are not generated during floating point operations on CPU, its safe under controlled environment.

OTHER TIPS

You actually violate the strict aliasing rules (section 3.10 of the C++ standard) with these reinterpret casts. This will probably blow up in your face when you turn on the compiler optimizations.

C++ standard, section 3.10 paragraph 15 says:

If a program attempts to access the stored value of an object through an lvalue of other than one of the following types the behavior is undefined

the dynamic type of the object,

a cv-qualified version of the dynamic type of the object,

a type similar to the dynamic type of the object,

a type that is the signed or unsigned type corresponding to the dynamic type of the object,

a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,

an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),

a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,

a char or unsigned char type.

Specifically, 3.10/15 doesn't allow us to access a float object via an lvalue of type unsigned int. I actually got bitten myself by this. The program I wrote stopped working after turning on optimizations. Apparently, GCC didn't expect an lvalue of type float to alias an lvalue of type int which is a fair assumption by 3.10/15. The instructions got shuffled around by the optimizer under the as-if rule exploiting 3.10/15 and it stopped working.

Under the following assumptions

float really corresponds to a 32bit IEEE-float,
sizeof(float)==sizeof(int)
unsigned int has no padding bits or trap representations

you should be able to do it like this:

/// returns a 30 bit number
unsigned int pack_float(float x) {
    unsigned r;
    std::memcpy(&r,&x,sizeof r);
    return r >> 2;
}

float unpack_float(unsigned int x) {
    x <<= 2;
    float r;
    std::memcpy(&r,&x,sizeof r);
    return r;
}

This doesn't suffer from the "3.10-violation" and is typically very fast. At least GCC treats memcpy as an intrinsic function. In case you don't need the functions to work with NaNs, infinities or numbers with extremely high magnitude you can even improve accuracy by replacing "r >> 2" with "(r+1) >> 2":

unsigned int pack_float(float x) {
    unsigned r;
    std::memcpy(&r,&x,sizeof r);
    return (r+1) >> 2;
}

This works even if it changes the exponent due to a mantissa overflow because the IEEE-754 coding maps consecutive floating point values to consecutive integers (ignoring +/- zero). This mapping actually approximates a logarithm quite well.

Blindly dropping the 2 LSBs of the float may fail for small number of unusual NaN encodings.

A NaN is encoded as exponent=255, mantissa!=0, but IEEE-754 doesn't say anything about which mantiassa values should be used. If the mantissa value is <= 3, you could turn a NaN into an infinity!

You should encapsulate it in a struct, so that you don't accidentally mix the usage of the tagged float with regular "unsigned int":

#include <iostream>
using namespace std;

struct TypedFloat {
    private:
        union {
            unsigned int raw : 32;
            struct {
                unsigned int num  : 30;  
                unsigned int type : 2;  
            };
        };
    public:

        TypedFloat(unsigned int type=0) : num(0), type(type) {}

        operator float() const {
            unsigned int tmp = num << 2;
            return reinterpret_cast<float&>(tmp);
        }
        void operator=(float newnum) {
            num = reinterpret_cast<int&>(newnum) >> 2;
        }
        unsigned int getType() const {
            return type;
        }
        void setType(unsigned int type) {
            this->type = type;
        }
};

int main() { 
    const unsigned int TYPE_A = 1;
    TypedFloat a(TYPE_A);

    a = 3.4;
    cout << a + 5.4 << endl;
    float b = a;
    cout << a << endl;
    cout << b << endl;
    cout << a.getType() << endl;
    return 0;
}

I can't guarantee its portability though.

How much precision do you need? If 16-bit float is enough (sufficient for some types of graphics), then ILM's 16-bit float ("half"), part of OpenEXR is great, obeys all kinds of rules (http://www.openexr.com/), and you'll have plenty of space left over after you pack it into a struct.

On the other hand, if you know the approximate range of values they're going to take, you should consider fixed point. They're more useful than most people realize.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow