Sorting floating numbers

Question

The problem at hand:

Calculate the scalar triple product of 3 3-dimensional vertices, being each component of each vector represented as binary32 float.
Being able to tell if a result of that calculation is greater than other result of that calculation.

So far so good, but if we directly apply the formulae to the vectors, some bits may be lost in the operations and we will be unable to discern two results. As @rcgldr pointed out, sorting is not the problem, the precision is.

One solution to floating point roundoff problems is to increase the number of bits, that is, use double. You said you have no double, so let's do it ourselves: perform the whole calculation in an array of unsigned char as long as we need.

Okay, okay, practical considerations:

The input is made of normalized vectors, so the length is no greater than one, that implies no component is greater than one
The exponent of a binary32 float ranges from -127 (zero, denormals) to 127 (or 128 for infinity), but all components will have exponent from -127 to 0 (or else they would be greater than one).
The maximum precision of the input is 24 bits.
Scalar triple product involves an vector product and an scalar product. In the vector product (which will happen first) there is subtraction of results of multiplication, and in the scalar product there is a sum of results of multiplication.

Considerations 2 and 3 tells us that the whole family of input components can be fit in an fixed-point format of size 127 bits for offsetting plus 24 bits for the mantissa, that's 19 bytes. Let's make it 24.

To be able to fully represent all possible sums and subtractions, one extra bit suffices (in the advent of carryover), but to fully represent all possible multiplication results, we need double the number of bits, so it resolutes that doubling the size is enough to represent the vector multiplication, and tripling it will make it enough for the next multiplication in the scalar product.

Here is a draft of class that loads a float to that very huge fixed point format, keeping the sign as a bool flag (there is a helper function rollArrayRight() that I'll post separately, but hopefully it's name explains it):

const size_t initialSize=24;
const size_t sizeForMult1=initialSize+initialSize;
const size_t finalSize=sizeForMult1+initialSize;
class imSoHuge{
public:
    bool isNegative;
    uint8_t r[finalSize];
    void load(float v){
        isNegative=false;
        for(size_t p=0;p<finalSize;p++)r[p]=0;
        union{
            uint8_t b[4];
            uint32_t u;
            float f;
        } reunion;
        reunion.f=v;
        if((reunion.b[3]&0x80) != 0x00)isNegative=true;
        uint32_t m, eu;
        eu=reunion.u<<1; //get rid of the sign;
        eu>>=24;
        m=reunion.u&(0x007fffff);
        if(eu==0){//zero or denormal
            if(m==0)return; //zero
        }else{
            m|=(0x00800000); //implicit leading one if it's not denormal
        }
        int32_t e=(int32_t)eu-127; //exponent is now in [e]. Debiased (does this word exists?)
        reunion.u=m;
        r[finalSize-1]=reunion.b[3];
        r[finalSize-2]=reunion.b[2];
        r[finalSize-3]=reunion.b[1];
        r[finalSize-4]=reunion.b[0];
        rollArrayRight(r, finalSize, e-(sizeForMult1*8)); //correct position for fixed-point
    }
    explicit imSoHuge(float v){
        load(v);
    }
};

When the class is constructed with the number 1.0f, for example, the array r have something like 00 00 00 00 80 00, notice that it is loaded to the lower part of it, the multiplications will ~roll~ the number accordingly to the upper bytes, and we can then recover our float.

To make it useful, we need to implement the equivalent to sum and multiplication, very straight-forward, as long as we remember we can only sum arrays that have been multiplied the same number of times (as in Triple product) or else theirs magnitude would not match.

One example where such class would make a difference:

Consider the following 3 vectors:

float a[]={0.0097905760, 0.0223784577, 0.9997016787};

float b[]={0.8248013854, 0.4413521587, 0.3534274995};

float c[]={0.4152690768, 0.3959976136, 0.8189856410};

And the following function that calculates the triple product: (hope I've got it right haha)

float fTripleProduct(float*a, float*b, float*c){
    float crossAB[3];
    crossAB[0]=(a[1]*b[2])-(a[2]*b[1]);
    crossAB[1]=(a[2]*b[0])-(a[0]*b[2]);
    crossAB[2]=(a[0]*b[1])-(a[1]*b[0]);
    float tripleP=(crossAB[0]*c[0])+(crossAB[1]*c[1])+(crossAB[2]*c[2]);
    return tripleP;
}

The result for fTripleProduct(a,b,c); is 0.1336331

If we change the last digit of the fisrt component of a from 0 to 6, making it 0.0097905766 (which have a different hexadecimal representation) and call the function again, the result is the same, but we know it should be greater.

Now consider we have implemented the multiplication, sum, and subtraction for the imSoHuge class and have a function to calculate the triple product using it

imSoHuge tripleProduct(float*a, float*b, float*c){
    imSoHuge crossAB[3];
    crossAB[0]=(imSoHuge(a[1])*imSoHuge(b[2]))-(imSoHuge(a[2])*imSoHuge(b[1]));
    crossAB[1]=(imSoHuge(a[2])*imSoHuge(b[0]))-(imSoHuge(a[0])*imSoHuge(b[2]));
    crossAB[2]=(imSoHuge(a[0])*imSoHuge(b[1]))-(imSoHuge(a[1])*imSoHuge(b[0]));
    imSoHuge tripleP=(crossAB[0]*imSoHuge(c[0]))+(crossAB[1]*imSoHuge(c[1]))+(crossAB[2]*imSoHuge(c[2]));
    return tripleP;
}

If we call that function for the two above versions of the vectors, the results in the array differ:

0  0  0  4 46 b9  4 69 39 3f 53 b8 19 e0  ... 

0  0  0  4 46 b9  4 85 93 82 df ba 7d 80  ...

And they differ after the precision of a binary32 float indeed, meaning that if we cast that array back to float, it will be the same float, but if we compare the arrays, we can tell which one is greater.

The put that reasoning to the test, I've made a full working example, that you can compile and run right away with -O3 -Wall -std=c++11 in GCC, or equivalent on another compiler and will output:

Using class: second result is greater
casting to float:
first reasult: 1.336331e-001
second result: 1.336331e-001
as floats, the results are the same: 1.336331e-001

The source code is here (working fine on Ideone):

Source Code on IDEONE C++11 code

If you have not migrated to C++11, the code compiles and run in C++98 if you define the exact-width types uint8_t, uint16_t, uint32_t, int32_t yourself.

How to use it?

Simple call the function tripleProduct with your inputs and compare the results using the provided overload comparators operators, you can also cast the class imSoHuge to float (after the calculation of triple product) using the provided overload cast operator.

You can provide an array of that class and comparators to any sorting algorithm.

Conclusions and considerations:

Notice that a float multiplication is now performed as a multiplication of two 70+ bytes long array, that means hundreds of time more clock cycles, plus the sums, comparisons etc, this shall be thousands of time slower, but hey, it's exact.

The whole design of the algorithm is to work with normalized vectors (there is some room here, as I don't know the precision or your normalization procedure), but it will all overflow and be meaningless with most 'greater-than-one' vectors.

You can easily cap the array of the result to as many bytes you wish, if keeping all that array in memory is too much. Very few cases will produce results diverging after ~12bytes

I haven't stress-tested everything, like denormals, and corner cases, there is some comments in the code to the critical points.

and of course:

You can easily improve everything, I was just willing to share the reasoning =)

Source code again

Main reference:

Single-precision floating-point format (Wikipedia)