Domanda

I am developing a programming language, September, which uses a tagged variant type as its main value type. 3 bits are used for the type (integer, string, object, exception, etc.), and 61 bits are used for the actual value (the actual integer, pointer to the object, etc.).

Soon, it will be time to add a float type to the language. I almost have the space for a 64-bit double, so I wanted to make use of doubles for calculations internally. Since I'm actually 3 bits short for storage, I would have to round the doubles off after each calculation - essentially resulting in a 61-bit double with a mantissa or exponent shorter by 3 bits.

But! I know floating point is fraught with peril and doing things which sound sensible on paper can produce disastrous results with FP math, so I have an open-ended question to the experts out there:

Is this approach viable at all? Will I run into serious error-accumulation problems in long-running calculations by rounding at each step? Is there some specific way in which I could do the rounding in order to avoid that? Are there any special values that I won't be able to treat that way (subnormals come to mind)?

Ideally, I would like my floats to be as well-behaved as a native 61-bit double would be.

È stato utile?

Soluzione

I would recommend borrowing bits from the exponent field of the double-precision format. This is the method described in this article (that you would modify to borrow 3 bits from the exponent instead of 1). With this approach, all computations that do not use very large or very small intermediate results behave exactly as the original double-precision computation would. Even computations that run into the subnormal region of the new format behave exactly as they would if a 1+8+52 61-bit format had been standardized by IEEE.

By contrast, naively borrowing any number of bits at all from the significand introduces many double-rounding problems, all the more frequent that you are rounding from a 52-bit significand to a significand with only a few bits removed. Borrowing one bit from the significand as you suggest in an edit to your question would be the worst, with half the operations statistically producing double-rounded results that are different from what the ideal “native 61-bit double” would have produced. This means that instead of being accurate to 0.5ULP, the basic operations would be accurate to 3/4ULP, a dramatic loss of accuracy that would derail many of the existing, finely-designed numerical algorithms that expect 0.5ULP.

Three is a significant number of bits to borrow from an exponent that only has 11, though, and you could also consider using the single-precision 32-bit format in your language (calling the single-precision operations from the host).

Lastly, I give visibility here to another solution found by Jakub: borrow the three bits from the significand, and simulate round-to-odd for the intermediate double-precision computation before converting to the nearest number in 49-explicit-significand-bit, 11-exponent-bit format. If this way is chosen, it may useful to remark that the rounding itself to 49 bits of significand can be achieved with the following operations:

if ((repr & 7) == 4) 
  repr += (repr & 8) >> 1);   /* midpoint case */
else
  repr += 4;
repr &= ~(uint64_t)7; /* round to the nearest */

Despite working on the integer having the same representation as the double being considered, the above snippet works even if the number goes from normal to subnormal, from subnormal to normal, or from normal to infinite. You will of course want to set a tag in the three bits that have been freed as above. To recover a standard double-precision number from its unboxed representation, simply clear the tag with repr &= ~(uint64_t)7;.

Altri suggerimenti

This is a summary of my own research and information from the excellent answer by @Pascal Cuoq.

There are two places where we can truncate the 3-bits we need: the exponent, and the mantissa (significand). Both approaches run into problems which have to be explicitly handled in order for the calculations to behave as if we used a hypothetical native 61-bit IEEE format.

Truncating the mantissa

We shorten the mantissa by 3 bits, resulting in a 1s+11e+49m format. When we do that, performing calculations in double-precision and then rounding after each computation exposes us to double rounding problems. Fortunately, double rounding can be avoided by using a special rounding mode (round-to-odd) for the intermediate computations. There is an academic paper describing the approach and proving its correctness for all doubles - as long as we truncate at least 2 bits.

Portable implementation in C99 is straightforward. Since round-to-odd is not one of the available rounding modes, we emulate it by using fesetround(FE_TOWARD_ZERO), and then setting the last bit if the FE_INEXACT exception occurs. After computing the final double this way, we simply round to nearest for storage.

The format of the resulting float loses about 1 significant (decimal) digit compared to a full 64-bit double (from 15-17 digits to 14-16).

Truncating the exponent

We take 3 bits from the exponent, resulting in a 1s+8e+52m format. This approach (applied to a hypothetical introduction of 63-bit floats in OCaml) is described in an article. Since we reduce the range, we have to handle out-of-range exponents on both the positive side (by simply 'rounding' them to infinity) and the negative side. Doing this correctly on the negative side requires biasing the inputs to any operation in order to ensure that we get subnormals in the 64-bit computation whenever the 61-bit result needs to be subnormal. This has to be done a bit differently for each operation, since what matters is not whether the operands are subnormal, but whether we expect the result to be (in 61-bit).

The resulting format has significantly reduced range since we borrow a whopping 3 out of 11 bits of the exponent. The range goes down from 10-308...10308 to about 10-38 to 1038. Seems OK for computation, but we still lose a lot.

Comparison

Both approaches yield a well-behaved 61-bit float. I'm personally leaning towards truncating the mantissa, for three reasons:

  • the "fix-up" operations for round-to-odd are simpler, do not differ from operation to operation, and can be done after the computation
  • there is a proof of mathematical correctness of this approach
  • giving up one significant digit seems less impactful than giving up a big chunk of the double's range

Still, for some uses, truncating the exponent might be more attractive (especially if we care more about precision than range).

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top