Question

Say I have two real number types. They may be floating or fixed point. How can I construct a new type whose values are at least the union of the two with the minimal number of bits?

There are 3 cases to consider:

Fixed (Qa.x) $\cup$ Fixed (Qb.y) - I think the best here is to use Qmax(a,b).max(x,y). I think this is optimal since I can't come up with anything smaller that will accurately represent the type.

Float (FaEx) $\cup$ Float (FbEy) - I think the best here is to use Fmax(a,b)Emax(x,y). Again I can't think of a more optimal solution.

I am using Q notation for representing fixed point types. I don't know how floating point types are typically represented; I'm using an analogous representation where FaEx means a bits of mantissa and x bits of exponent.

The difficult case is:

Fixed (Qa.x) $\cup$ Float (FbEy) - The best I can come up with is Qmax(a,n).max(x,m) where n is the minimal bits to represent the biggest number the float can be and m is the minimal number of bits to represent the smallest positive fraction the float can be. This seems extremely inefficient as it extends the floating point's most accurate precision to its entire range. Thus for any decent sized floating point type the resulting union type will be extremely large.

Here are some ASCII diagrams of the three cases (simplified), and why I think I'm wrong:

     0/4     1/4     2/4     3/4     4/4     5/4     6/4     7/4     8/4     9/4    10/4    11/4    12/4    13/4    14/4    15/4    16/4
      | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Q1.2  |.......|.......|.......|.......|.......|.......|.......|........................................................................
U     .................................................................................................................................
Q0.3  |...|...|...|...|...|...|...|....................................................................................................
=     .................................................................................................................................
Q1.3  |...|...|...|...|...|...|...|...|...|...|...|...|...|...|........................................................................

     0/4     1/4     2/4     3/4     4/4     5/4     6/4     7/4     8/4     9/4    10/4    11/4    12/4    13/4    14/4    15/4    16/4
      | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
F1E2  ........|...|...|.......|.......|...............|...............|...............................|................................
U     .................................................................................................................................
F2E1  ................|...|...|...|...|.......|.......|.......|........................................................................
=     .................................................................................................................................
F2E2  ................................|.......|.......|.......|.......|...............|...............|...............|................


     0/4     1/4     2/4     3/4     4/4     5/4     6/4     7/4     8/4     9/4    10/4    11/4    12/4    13/4    14/4    15/4    16/4
      | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Q0.3  |...|...|...|...|...|...|...|....................................................................................................
U     .................................................................................................................................
F1E2  ........|...|...|.......|.......|...............|...............|...............................|................................
=     .................................................................................................................................
Q2.3  |...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|........
F?E?  |...|...|...|...|...|...|...|...|.......|.......|.......|.......|...............................|...............................|

From my math the best I could do would be Q2.3, but it is fairly obvious that there should exist some floating point type that stops having the necessary accuracy once the floating point part's accuracy is no longer needed. Of course I have to be careful if the fixed point type is more accurate than even the most accurate range of the floating point type, but I still feel like I'm missing a nice solution.

Any idea what binary type will be the smallest superset of the union between a fixed and floating point type?

NOTE: I know that this also emphasizes the benefits and drawbacks of fixed and floating types, but I feel like it should be possible to do at least a little bit better. Especially in the situation where the types have known range boundaries.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top