Analysis of float/double precision in 32 decimal digits

Question 1

From the standard:

There are three ﬂoating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of ﬂoating-point types is implementation-deﬁned.

So you can see your issue with this question: the standard doesn't actually say how precise floats are.

In terms of standard implementations, you need to look at IEEE754, which means the other two answers from Irineau and Davidmh are perfectly valid approaches to the problem.

As to suffix letters to indicate type, again looking at the standard:

The type of a ﬂoating literal is double unless explicitly speciﬁed by a suﬃx. The suﬃxes f and F specify float, the suﬃxes l and L specify long double.

So your attempt to create a long double will just have the same precision as the double literal you are assigning to it unless you use the L suffix.

I understand that some of these answers may not seem satisfactory, but there is a lot of background reading to be done on the relevant standards before you can dismiss answers. This answer is already longer than intended so I won't try and explain everything here.

And as a final note: Since the precision is not clearly defined, why not have a constant that's longer than it needs to be? Seems to make sense to always define a constant that is precise enough to always be representable regardless of type.

Question 2

Python's numerical library, numpy, has a very convenient float info function. All the types are the equivalent to C:

For C's float:

print numpy.finfo(numpy.float32)
Machine parameters for float32
---------------------------------------------------------------------
precision=  6   resolution= 1.0000000e-06
machep=   -23   eps=        1.1920929e-07
negep =   -24   epsneg=     5.9604645e-08
minexp=  -126   tiny=       1.1754944e-38
maxexp=   128   max=        3.4028235e+38
nexp  =     8   min=        -max
---------------------------------------------------------------------

For C's double:

print numpy.finfo(numpy.float64)
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

And for C's long float:

print numpy.finfo(numpy.float128)
Machine parameters for float128
---------------------------------------------------------------------
precision= 18   resolution= 1e-18
machep=   -63   eps=        1.08420217249e-19
negep =   -64   epsneg=     5.42101086243e-20
minexp=-16382   tiny=       3.36210314311e-4932
maxexp= 16384   max=        1.18973149536e+4932
nexp  =    15   min=        -max
---------------------------------------------------------------------

So, not even long float (128 bits) will give you the 32 digits you want. But, do you really need them all?

Question 3

Some compilers have an implementation of the binary128 floating point format, normalized by IEEE 754-2008. Using gcc, for example, the type is __float128. That floating point format have about 34 decimal precision (log(2^113)/log(10)).

You can use the Boost Multiprecision library, to use their wrapper float128. That implementation will either use native types, if available, or use a drop-in replacement.

Let's extend your experiment with that new non-standard type __float128, with a recent g++ (4.8):

// Compiled with g++ -Wall -lquadmath essai.cpp
#include <iostream>
#include <iomanip>
#include <quadmath.h>
#include <sstream>

std::ostream& operator<<(std::ostream& out, __float128 f) {
  char buf[200];
  std::ostringstream format;
  format << "%." << (std::min)(190L, out.precision()) << "Qf";
  quadmath_snprintf(buf, 200, format.str().c_str(), f);
  out << buf;
  return out;
}

int main() {
  std::cout.precision(32);
  std::cout << "Number:    0.70710678118654752440084436210485\n";

  const float f = 0.70710678118654752440084436210485f;
  std::cout << "float:     " << std::setprecision(32) << f << std::endl;

  const double d = 0.70710678118654752440084436210485; // no f extension
  std::cout << "double:    " << std::setprecision(32) << d << std::endl;

  const double df = 0.70710678118654752440084436210485f;
  std::cout << "doublef:   " << std::setprecision(32) << df << std::endl;

  const long double ld = 0.70710678118654752440084436210485;
  std::cout << "l double:  " << std::setprecision(32) << ld << std::endl;

  const long double ldl = 0.70710678118654752440084436210485l; // l suffix!
  std::cout << "l doublel: " << std::setprecision(32) << ldl << std::endl;

  const __float128 f128 = 0.70710678118654752440084436210485;
  const __float128 f128f = 0.70710678118654752440084436210485f; // f suffix
  const __float128 f128l = 0.70710678118654752440084436210485l; // l suffix
  const __float128 f128q = 0.70710678118654752440084436210485q; // q suffix

  std::cout << "f128:      " << f128 << std::endl;
  std::cout << "f f128:    " << f128f << std::endl;
  std::cout << "l f128:    " << f128l << std::endl;
  std::cout << "q f128:    " << f128q << std::endl;
}

The output is:

                   *       ** ***        ****
                   v        v v             v
Number:    0.70710678118654752440084436210485
float:     0.707106769084930419921875
double:    0.70710678118654757273731092936941
doublef:   0.707106769084930419921875
l double:  0.70710678118654757273731092936941
l doublel: 0.70710678118654752438189403651592
f128:      0.70710678118654757273731092936941
f f128:    0.70710676908493041992187500000000
l f128:    0.70710678118654752438189403651592
q f128:    0.70710678118654752440084436210485

where * is the last accurate digit of float, ** the last accurate digit of double, *** the last accurate digit of long double, and **** is the last accurate digit of __float128.

As said by another answer, the C++ standard does not say what is the precision of the various floating point types (like it does not says what is the size of the integral types). It only specifies minimal precision/size of those types. But the norm IEEE754 does specify all that! The FPU of all lot of architectures does implement that norm IEEE745, and the recent versions of gcc implement the type binary128 of the norm with the extension __float128.

As for the explanation of your code, or mine, an expression like 0.70710678118654752440084436210485f is a floating-point literal. It has a type, that is defined by its suffix, here f for float. And thus the value of the literal correspond to the nearest value of the given type from the given number. That explains why, for example, the precision of "doublef" is the same as for "float", in your code. In recent gcc versions, there is an extension, that allows to define floating-point literals of type __float128, with the Q suffix (Quadruple-precision).