Вопрос

The following code is slower when a is declared as a double, as opposed to an int:

double a = 0;
int j[1000];

for(int i=0; i<1000; i++){
    a += (i * j[i]);
}

does this decrease in performance of double-precision addition arise from the compiler selecting different assembly instructions than what would be chosen if a had been declared as int?

I am trying to understand whether the CPU does any "conversions" of its own for single/double precision at run-time, which one cannot see in the assembly and cost execution time?

Это было полезно?

Решение

Let us break down the expression in your loop into their parts. To do so, we will rewrite the code so that every line is exactly one assignment and one operation. When starting with the int version, it looks like the following:

// not depending on a
/* 1 */ auto t1 = j[i];
/* 2 */ auto t2 = i * j

// depending on decltype(a)
/* 3 */ decltype(a+t2) t3 = static_cast<decltype(a+t2)>(t2);
/* 4 */ decltype(a+t3) t4 = static_cast<decltype(a+t3)>(a);
/* 5 */ a = t3 + t4;

The first two operations do not depend on the type of a at all, and will do the exact same thing in either case.

However, starting with operation 3, there is a difference. The reason for that is for the addition of a and t2 the compiler must first convert them to a common type. In the case where a is an integer, operations 3 and 4 do nothing at all (int + int yields int, so both casts convert ints to ints). In the case where a is a double however, t2 must be converted to a double (int + double gives double) before the addition.

This means that the type of the addition in operation 5 is also different: It can be an int or a double addition. Ignoring the obvious aspect that a double will usually be twice as large as an int, this means that the computer needs to do something different at this point.

Implications for x64

When compiling this program for a modern x64 machine using an optimizing compiler, it should be noted that the whole program may be optimized away when stated as is. Assuming this does not happen and your compiler does not apply any illegal optimizations, and that you can live with the UB introduced by using uninitialized variables (the elements of j), the following could happen:

// not depending on a
MOV EAX, i // copy i to EAX register
IMUL j[i] // EAX = EAX * j[i] (high 32 bits are stored in EDX and ignored)

// if a is int
ADD a, EAX // integer addition: a += EAX

// else if a is double
CVTSI2SD XMM0, EAX // convert the multiplication result to double
ADDPD a, XMM0 // double addition: a += XMM0

// endif

A good compiler would unroll the loop a bit and interleave a few of these, since the loop limit is known. As you can see, there is at least a twofold increase in operations, and the instructions for a dependency chain. Also, the instructions in the second version are slower than the single one in the first version.

While I am sure that the second version could be stated in a far more efficient version, it should be noted that the integer ADD of the first version is one of the fastest operations on just about any CPU and will usually be faster than its floating point equivalent.

So, to answer your question: The CPU does indeed perform conversions between floating point and integers - which are visible in the assembly and have a (potentially significant) runtime cost.

What about single precision?

Since you asked about single precision as well, let us check what happens when using float instead:

// not depending on a
MOV EAX, i // copy i to EAX register
IMUL j[i] // EAX = EAX * j[i] (high 32 bits are stored in EDX and ignored)

// if a is float
CVTSI2SS XMM0, EAX // convert the multiplication result to float
ADDPS a, XMM0 // float addition: a += XMM0

The assembly does not show a significant difference (we just exchanged two D for double with S for single). And, interestingly enough, the difference in performance will be slight as well (e.g. a Haswell core will take 1 µop more for the conversion to double versus the conversion to float, while the addition itself shows identical performance).

Verification

To verify my claims, I have run your loop 2000000 times and ensured that a was not optimized away. The results are:

int   : 601.1 ms
float : 2567 ms
double: 2593 ms

Другие советы

I'm ignoring that your example doesn't compile and that the array j is not initialized (I'm hoping you'll fix this).

Floating point arithmetic is in general slower than is integer arithmetic. However, your code has an even more expensive operation: Converting an integer to a floating point number. Your code suffers a double whammy (pun intended) by using mixed mode arithmetic.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top