Easiest way to convert a decimal float to bit representation manually based on IEEE 754, without using any library

StackOverflow https://stackoverflow.com/questions/22962040

Question

I know there are number ways to read every bit of a IEEE 754 float using written libraries.

I don't want that, and I want to be able to manually convert a decimal float to binary representation based on IEEE 754.

I understand how IEEE 754 works and I am just trying to apply it.

I ask this question here just want to see whether my way is normal or stupid and I am also wondering how PC does it quickly.


If I am given a decimal float in a string, I need to figure out what the E is and what the M is.

  1. get the two parts out: integer part i and fraction part f.

  2. deal with f. I constantly multiple 2 and get the integer part (either 0 or 1) and remove the integer part and then repeat, until it becomes 0

  3. convert i to bits. This is easy I just constantly mod 2 and div 2 to get all bits of i.

for example, to convert f part

0.390625 * 2 = 0.78125 0
0.78125 * 2 = 1.5625 1
0.5625 * 2 = 1.125 1
0.125 * 2 = 0.25 0
0.25 * 2 = 0.5 0
0.5 * 2 = 1 1
0

In this case, the temparay bits of 0.390625 is 0 1 1 0 0 1.


Now, I have the bits for i and for f.

If all bits for i is 0, then on bits of f I shift_left it until the first 1 is gone, according to the default hidden 1 of M. I get M, then give the value of shifting to E, considering of E's baseline of course.

If i is not 0, then I concatenate both bits part and calculate how many shift_right I need to do to make the concatenated bits to be 1, then give this value to E


I guess all my steps are not wrong. But I feel it very troublesome.

Is there a easy and clean way?

How does PC does it?

Was it helpful?

Solution

See the files src/lib/floating_point.ml and src/lib/floating_point.mli in Frama-C. They implement the conversion from decimal representation to floating-point for single-precision and double-precision (you cannot obtain the former from the latter because of double rounding issues), without any external library. The files are covered by the LGPL 2.1. This implementation is the subject of a couple of blog posts starting at this one and continuing with this one.

This is probably close to the simplest conversion function one can make, as in writing this function, I had no performance constraints and only hoped to keep the code as simple and as correct as possible, without wanting a dependence towards an existing library such as MPFR.

...
type parsed_float = {
  f_nearest : float ;
  f_lower : float ;
  f_upper : float ;
}

val single_precision_of_string: string -> parsed_float
val double_precision_of_string: string -> parsed_float
...

OTHER TIPS

I don't understand your treatment of the fraction. As shown, you are doing decimal fraction arithmetic, which would give correct results but introduces its own implementation difficulties. Doing binary fraction arithmetic would depend on converting the fraction to a binary fraction in order to convert it to a binary fraction.

I think it might be simpler to work entirely in binary integers, though you would still need an extended form, such as BigInteger.

To do that, first note the number of digits after the decimal point, D. Convert the decimal digit string to an integer N, ignoring the decimal point. The value is N/10**D, using "**" to represent power. Calculate 10**D as a binary integer.

Calculate N/10**D by binary long division, stopping when you have F+2 significant bits in the result, where F is the number of fraction bits in your floating point format. Note the location of the binary point in this result.

The most significant one bit will not be used if the number is in the normal range. To correctly round down to F fraction bits, you need both the least significant of the F+2 bits, call it G, and also a bit S that is zero if, and only if, the remainder is zero. If G is 0, use the F fraction bits unchanged. If G and S are both one, you need to round up. If G is one and S is zero the exact result is half way between two representable values, and you should round to even.

Calculate the exponent from the position of the most significant bit relative to the binary point, after processing any carry-out due to rounding up. If the exponent is in range, you are done. If it is too bit, return infinity of the appropriate sign. If it is too small, you need to denormalize. To get the rounding right, recompute G and S from the bits you are dropping and the old value of S.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top