Double in place of Float and Float rounding

Question 1

The documentation for BigDecimal is silent about how floatValue() rounds. I presume it uses round-to-nearest, ties-to-even.

left and right are set to .99 and .97, respectively. When these are converted to double in round-to-nearest mode, the results are 0.9899999999999999911182158029987476766109466552734375 (in hexadecimal floating-point, 0x1.fae147ae147aep-1) and 0.9699999999999999733546474089962430298328399658203125 (0x1.f0a3d70a3d70ap-1). When those are subtracted, the result is 0.020000000000000017763568394002504646778106689453125, which clearly exceeds .02.

When .99 and .97 are converted to float, the results are 0.9900000095367431640625 (0x1.fae148p-1) and 0.9700000286102294921875 (0x1.f0a3d8p-1). When those are subtracted, the result is 0.019999980926513671875, which is clearly less than .02.

Simply put, when a decimal numeral is converted to floating-point, the rounding may be up or down. It depends on where the number happens to lie relative to the nearest representable floating-point values. If it is not controlled or analyzed, it is practically random. Thus, sometimes you end up with a greater value than you might have expected, and sometimes you end up with a lesser value.

Using double instead of float would not guarantee that results similar to the above do not occur. It is merely happenstance that the double value in this case exceeded the exact mathematical value and the float value did not. With other numbers, it could be the other way around. For example, with double, .09-.07 is less than .02, but, with float, .09f - .07f` is greater than .02.

There is a lot of information about how to deal with floating-point arithmetic, such as Handbook of Floating-Point Arithmetic. It is too large a subject to cover in Stack Overflow questions. There are university courses on it.

Often on today’s typical processors, there is little extra expense for using double rather than float; simple scalar floating-point operations are performed at nearly the same speeds for double and float. Performance differences arise when you have so much data that the time to transfer them (from disk to memory or memory to processor) becomes important, or the space they occupy on disk becomes large, or your software uses SIMD features of processors. (SIMD allows processors to perform the same operation on multiple pieces of data, in parallel. Current processors typically provide about twice the bandwidth for float SIMD operations as for double SIMD operations or do not provide double SIMD operations at all.)

Question 2

Double can represent numbers with a larger number of significant digits, with a greater range and vice versa for float. Double computations are more costly in terms of CPU. So it all depends on your application. Binary numbers cannot exactly represent a number such as 1/5. These numbers end up being rounded, thereby introducing errors that are certainty at the origin of you failed asserts. See http://en.m.wikipedia.org/wiki/Floating_point for more details.

[EDIT] If all else fails run a benchmark:

package doublefloat;

/**
 *
 * @author tarik
 */
public class DoubleFloat {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        // TODO code application logic here
        long t1 = System.nanoTime();
        double d = 0.0;
        for (long i=0; i<1000000000;i++) {
            d = d * 1.01;
        }
        long diff1 = System.nanoTime()-t1;
        System.out.println("Double ticks: " + diff1);

        t1 = System.nanoTime();
        float f = 0.0f;
        for (long i=0; i<1000000000;i++) {
            f = f * 1.01f;
        }
        long diff2 = System.nanoTime()-t1;
        System.out.println("Float  ticks: " + diff2);
        System.out.println("Difference %: " + (diff1 - diff2) * 100.0 / diff1);    
    }
}

Output:

Double ticks: 3694029247
Float  ticks: 3355071337
Difference %: 9.175831790592209

This test was ran on a PC with an Intel Core 2 Duo. Note that since we are only dealing with a single variable in a tight loop, there is no way to overwhelm the available memory bandwidth. In fact one of the core was consistently showing 100% CPU during each run. Conclusion: The difference is 9% which might be considered negligible indeed.

Second test involves the same test but using a relatively large amount of memory 140MB and 280MB for float and double respectively:

package doublefloat;

/**
 *
 * @author tarik
 */
public class DoubleFloat {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        final int LOOPS = 70000000;
        long t1 = System.nanoTime();
        double d[] = new double[LOOPS];
        d[0] = 1.0;
        for (int i=1; i<LOOPS;i++) {
            d[i] = d[i-1] * 1.01;
        }
        long diff1 = System.nanoTime()-t1;
        System.out.println("Double ticks: " + diff1);

        t1 = System.nanoTime();
        float f[] = new float[LOOPS];
        f[0] = 1.0f;
        for (int i=1; i<LOOPS;i++) {
            f[i] = f[i-1] * 1.01f;
        }
        long diff2 = System.nanoTime()-t1;
        System.out.println("Float  ticks: " + diff2);
        System.out.println("Difference %: " + (diff1 - diff2) * 100.0 / diff1);    
    }
}

Output:

Double ticks: 667919011
Float  ticks: 349700405
Difference %: 47.64329218950769

Memory bandwidth is overwhelmed, yet I can still see the CPU peaking at 100% for a short period of time.

Conclusion: This benchmark somewhat confirms that using double takes 9% more time that float on CPU intensive applications and about 50% more time in data intensive applications. It also confirms Eric Postpischil note, that CPU overhead is relatively negligible (9%) in comparison with the performance impact of limited memory bandwidth.