x86-64 long double precision

https://stackoverflow.com/questions/2799684

04-10-2019
|

Question

What is the actual precision of long double on Intel 64-bit platforms? is it 80 bits padded to 128 or actual 128 bit?

if former, besides going gmp, is there another option to achieve true 128 precision?

Solution

x86-64 precision is the same as regular x86. Extended double is 80 bits, using the x87 ISA, with 6 padding bytes added. There is no 128-bit FP hardware.

A software implementation of quad or extended quad precision might benefit from the x86-64 64x64 => 128 integer multiply instruction, though.

OTHER TIPS

I would recommend using MPFR. It is a more sophisticated multiple-precision floating point library that is built on top of GMP.

There is a good chance that it's 64 bit for both (depending on the compiler and OS), because the compiler is emitting scalar SSE2 instead of x87 instructions.

x86 doesn't support higher precision than 80 bits, but if you really need more than 64 bits for a FP algorithm most likely you should check your numerics instead of solving the problem with brute force.

There are a few of options.

use double-double to represent quad. For example, see http://www.codeproject.com/Articles/884606/The-double-double-type. However, the type does not confirm to IEEE standard. You can tell by inspecting its epsilon value being less accurate than IEEE standard 128-bit float which is 1.926E-34.
use true IEEE standard 128-bit floats. Microsoft VC++ compiler does not provide such type. Intel C++ compiler does provide a type _Quad, although its implementation is not complete (no I/O operations) at this time.
use third party library. I have recently created a library called double128 that is based on Intel C++ _Quad but adds I/O operations. It works with Microsoft VC++. You can visit http://www.cg-inc.com/Product/Double128 for more information.

I recommend the Boost wrappers over MPFR or GMP:

Boost 1.70: cpp_bin_float.

As well as arbitrary types to any desired precision, the following types are provided:

cpp_bin_float_single           (24 bits + mantissa = 32 bits)
cpp_bin_float_double           (53 bits + mantissa = 64 bits)
cpp_bin_float_double_extended  (64 bits + mantissa)
cpp_bin_float_quad             (113 bits + mantissa = 128 bits)
cpp_bin_float_oct              (237 bits) + mantissa = 256 bits)

Boost offers almost out-of-the-box functionality. Once compiled, all one needs to do is add a pointer within the Visual Studio project to the include and library directories.

Tested with Visual Studio 2017 + Boost v1.70.

See instructions to compile boost.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow