Some compilers have an implementation of the binary128
floating point format, normalized by IEEE 754-2008. Using gcc, for example, the type is __float128
. That floating point format have about 34 decimal precision (log(2^113)/log(10)
).
You can use the Boost Multiprecision library, to use their wrapper float128. That implementation will either use native types, if available, or use a drop-in replacement.
Let's extend your experiment with that new non-standard type __float128
, with a recent g++ (4.8):
// Compiled with g++ -Wall -lquadmath essai.cpp
#include <iostream>
#include <iomanip>
#include <quadmath.h>
#include <sstream>
std::ostream& operator<<(std::ostream& out, __float128 f) {
char buf[200];
std::ostringstream format;
format << "%." << (std::min)(190L, out.precision()) << "Qf";
quadmath_snprintf(buf, 200, format.str().c_str(), f);
out << buf;
return out;
}
int main() {
std::cout.precision(32);
std::cout << "Number: 0.70710678118654752440084436210485\n";
const float f = 0.70710678118654752440084436210485f;
std::cout << "float: " << std::setprecision(32) << f << std::endl;
const double d = 0.70710678118654752440084436210485; // no f extension
std::cout << "double: " << std::setprecision(32) << d << std::endl;
const double df = 0.70710678118654752440084436210485f;
std::cout << "doublef: " << std::setprecision(32) << df << std::endl;
const long double ld = 0.70710678118654752440084436210485;
std::cout << "l double: " << std::setprecision(32) << ld << std::endl;
const long double ldl = 0.70710678118654752440084436210485l; // l suffix!
std::cout << "l doublel: " << std::setprecision(32) << ldl << std::endl;
const __float128 f128 = 0.70710678118654752440084436210485;
const __float128 f128f = 0.70710678118654752440084436210485f; // f suffix
const __float128 f128l = 0.70710678118654752440084436210485l; // l suffix
const __float128 f128q = 0.70710678118654752440084436210485q; // q suffix
std::cout << "f128: " << f128 << std::endl;
std::cout << "f f128: " << f128f << std::endl;
std::cout << "l f128: " << f128l << std::endl;
std::cout << "q f128: " << f128q << std::endl;
}
The output is:
* ** *** ****
v v v v
Number: 0.70710678118654752440084436210485
float: 0.707106769084930419921875
double: 0.70710678118654757273731092936941
doublef: 0.707106769084930419921875
l double: 0.70710678118654757273731092936941
l doublel: 0.70710678118654752438189403651592
f128: 0.70710678118654757273731092936941
f f128: 0.70710676908493041992187500000000
l f128: 0.70710678118654752438189403651592
q f128: 0.70710678118654752440084436210485
where *
is the last accurate digit of float
, **
the last accurate digit of
double
, ***
the last accurate digit of long double
, and ****
is the
last accurate digit of __float128
.
As said by another answer, the C++ standard does not say what is the precision of the various floating point types (like it does not says what is the size of the integral types). It only specifies minimal precision/size of those types. But the norm IEEE754 does specify all that! The FPU of all lot of architectures does implement that norm IEEE745, and the recent versions of gcc implement the type binary128
of the norm with the extension __float128
.
As for the explanation of your code, or mine, an expression like 0.70710678118654752440084436210485f
is a floating-point literal. It has a type, that is defined by its suffix, here f
for float
. And thus the value of the literal correspond to the nearest value of the given type from the given number. That explains why, for example, the precision of "doublef" is the same as for "float", in your code. In recent gcc versions, there is an extension, that allows to define floating-point literals of type __float128
, with the Q
suffix (Quadruple-precision).