Let us break down the expression in your loop into their parts. To do so, we will rewrite the code so that every line is exactly one assignment and one operation. When starting with the int
version, it looks like the following:
// not depending on a
/* 1 */ auto t1 = j[i];
/* 2 */ auto t2 = i * j
// depending on decltype(a)
/* 3 */ decltype(a+t2) t3 = static_cast<decltype(a+t2)>(t2);
/* 4 */ decltype(a+t3) t4 = static_cast<decltype(a+t3)>(a);
/* 5 */ a = t3 + t4;
The first two operations do not depend on the type of a
at all, and will do the exact same thing in either case.
However, starting with operation 3, there is a difference. The reason for that is for the addition of a
and t2
the compiler must first convert them to a common type. In the case where a
is an integer, operations 3 and 4 do nothing at all (int
+ int
yields int
, so both casts convert int
s to int
s). In the case where a
is a double
however, t2
must be converted to a double
(int
+ double
gives double
) before the addition.
This means that the type of the addition in operation 5 is also different: It can be an int
or a double
addition. Ignoring the obvious aspect that a double
will usually be twice as large as an int
, this means that the computer needs to do something different at this point.
Implications for x64
When compiling this program for a modern x64 machine using an optimizing compiler, it should be noted that the whole program may be optimized away when stated as is. Assuming this does not happen and your compiler does not apply any illegal optimizations, and that you can live with the UB introduced by using uninitialized variables (the elements of j
), the following could happen:
// not depending on a
MOV EAX, i // copy i to EAX register
IMUL j[i] // EAX = EAX * j[i] (high 32 bits are stored in EDX and ignored)
// if a is int
ADD a, EAX // integer addition: a += EAX
// else if a is double
CVTSI2SD XMM0, EAX // convert the multiplication result to double
ADDPD a, XMM0 // double addition: a += XMM0
// endif
A good compiler would unroll the loop a bit and interleave a few of these, since the loop limit is known. As you can see, there is at least a twofold increase in operations, and the instructions for a dependency chain. Also, the instructions in the second version are slower than the single one in the first version.
While I am sure that the second version could be stated in a far more efficient version, it should be noted that the integer ADD
of the first version is one of the fastest operations on just about any CPU and will usually be faster than its floating point equivalent.
So, to answer your question: The CPU does indeed perform conversions between floating point and integers - which are visible in the assembly and have a (potentially significant) runtime cost.
What about single precision?
Since you asked about single precision as well, let us check what happens when using float
instead:
// not depending on a
MOV EAX, i // copy i to EAX register
IMUL j[i] // EAX = EAX * j[i] (high 32 bits are stored in EDX and ignored)
// if a is float
CVTSI2SS XMM0, EAX // convert the multiplication result to float
ADDPS a, XMM0 // float addition: a += XMM0
The assembly does not show a significant difference (we just exchanged two D
for double
with S
for single
). And, interestingly enough, the difference in performance will be slight as well (e.g. a Haswell core will take 1 µop more for the conversion to double versus the conversion to float, while the addition itself shows identical performance).
Verification
To verify my claims, I have run your loop 2000000 times and ensured that a
was not optimized away. The results are:
int : 601.1 ms
float : 2567 ms
double: 2593 ms