Question

I'm wanting to find the number of mantissa digits and the unit round-off on a particular computer. I have an understanding of what these are, just no idea how to find them - though I understand they can vary from computer to computer.

I need this number in order to perform certain aspects of numerical analysis, like analyzing errors.

What I'm currently thinking is that I could write a small c++ program to slowly increment a number until overflow occurs, but I'm not sure what type of number to use.

Am I on the right track? How exactly does one go about calculating this?

Was it helpful?

Solution

I would think that whatever language you were using would specify how floats were stored. I know Java does this by use of a specific IEEE standard (754, I think).

If it's not specified, I would think you could just do your own check by adding 0.5 to 1 to see if the actual number changes. If it does, then add 0.25 to 1, the 0.125 to 1, and so on until the number doesn't change, something like:

float a = 1;
float b = 0.5;
int bits = 0;
while (a + b != a) {
    bits = bits + 1;
    b = b / 2;
}

If you only had 3 mantissa bits, then 1 + 1/16 would be equal to 1.

Then you've exhausted your mantissa bits.

You might actually need the base number to be 2 rather than 1, since IEEE754 uses an implied '1+' at the start.

EDIT:

It appears the method described above may have some issues as it gives 63 bits for a system that clearly has 4-byte floats.

Whether that's to do with intermediate results (I doubt it since the same code with explicit casts [while (((float)(a + b) != (float)(a))] has similar problems) or (more likely, I believe) the possibility that the unit value a can be represented with bits closer to the fractional b by adjusting the exponent, I don't yet know.

For now, it's best to rely on the language information I mentioned above such as use of IEEE754 (if that information is available).

I'll leave the problematic code in as a trap for wary players. Maybe someone with more floating point knowledge then I can leave a note explaining why it acts strangely (no conjecture, please :-).

EDIT 2:

This piece of code fixes it by ensuring intermediates are stored in floats. Turns out Jonathan Leffler was right - it was intermediate results.

#include <stdio.h>
#include <float.h>

int main(void) {
    float a = 1;
    float b = 0.5;
    float c = a + b;
    int bits = 1;
    while (c != a) {
        bits = bits + 1;
        b = b / 2;
        c = a + b;
    }
    printf("%d\n",FLT_MANT_DIG);
    printf("%d\n",bits);
    return 0;

}

This code outputs (24,24) to show that the calculated value matches the one in the header file.

Whilst written in C, it should be applicable to any language (specifically one where the information isn't available in a header or by virtue that it's specified in the language documentation). I only tested in C because Eclipse takes so long to start on my Ubuntu box :-).

OTHER TIPS

For C, and by extension C++, the information is in the <float.h> or <cfloat> headers.

For C99, the information is in section 5.2.4.2.2 of the standard:

  • FLT_RADIX
  • FLT_MANT_DIG
  • FLT_DIG
  • FLT_EPSILON
  • FLT_MIN_EXP
  • FLT_MIN
  • FLT_MIN_10_EXP
  • FLT_MAX_EXP
  • FLT_MAX
  • FLT_MAX_10_EXP

And similarly for DBL and LDBL variations on most of these (no DBL_RADIX or LDBL_RADIX). The standard suggests values appropriate for IEEE 754 (the older version of the IEEE 754 standard that was current in 1999; there was a new version published in, I believe, 2008).

You might want to check out <limits> in your C++ library:

#include <iostream>
#include <limits>
#include <typeinfo>

template <typename T>
void printDetailsFor() {
    std::cout
        << "Printing details for " << typeid(T).name() << ":\n"
        << "\tradix:        " << std::numeric_limits<T>::radix        << "\n"
        << "\tradix digits: " << std::numeric_limits<T>::digits       << "\n"
        << "\tepsilon:      " << std::numeric_limits<T>::epsilon()    << "\n"
        << std::endl;
}

int main() {
    printDetailsFor<int>();
    printDetailsFor<float>();
    printDetailsFor<double>();
    printDetailsFor<long double>();
    return 0;
}

I think that you want std::numeric_limits<T>::digits which should be one more than the number of mantissa bits. My machine prints out:

Printing details for i:
    radix:        2
    radix digits: 31
    epsilon:      0

Printing details for f:
    radix:        2
    radix digits: 24
    epsilon:      1.19209e-07

Printing details for d:
    radix:        2
    radix digits: 53
    epsilon:      2.22045e-16

Printing details for e:
    radix:        2
    radix digits: 64
    epsilon:      1.0842e-19
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top