Why do more Pentium assembly instructions take less time?
Question
Below is a clip from a listing of two Pentium assembly sequences. We have an outside loop that is trying to time our sequences and is doing a call-through-table to get to these routines. Therefore, the outside call is being made from the same location every time. The two sequences differ in that the first one has one less instruction than the second.
The results we get on two Intel machines are very different.
The CPUID instruction tells the Family, Model, and Stepping.
Machine 1: Family 6, Model 15 Stepping 11. CPUZ reports "Intel Core 2 Duo E6750"
The instructions execute at statistically the same speed.
Machine 2: Family 15, Model 3, Stepping 3. CPUZ reports "Intel Pentium 4"
The first sequence takes about 8% longer than the second sequence.
We simply can not explain the increase in time. There should not be any different flag hold-off, prediction of branches, register usage problems, etc. At least not that we can tell.
Does anyone have an idea why the first sequence would take longer to execute on the one machine?
Edit: Adding "XOR PTR ereg, 0" to the first sequence does make the timing match the second one on the Pentium 4. Curious.
First Sequence:
00000040 ALUSHIFT_AND_C_V_E LABEL NEAR
00000040 0F B7 04 55 MOVZX EAX, gwr[(SIZEOF WORD) * EDX] ; EAX = 0000000000000000 LLLLLLLLLLLLLLLL
00000000 E
00000048 0F B7 14 4D MOVZX EDX, gwr[(SIZEOF WORD) * ECX] ; EDX = 0000000000000000 RRRRRRRRRRRRRRRR
00000000 E
00000050 23 C2 AND EAX, EDX ; AX = L&R (result)
00000052 A3 00000000 E MOV dvalue, EAX ; Save the temporary ALU/Shifter result
00000057 C3 RET ; Return
Second Sequence:
00000060 ALUSHIFT_AND_C_V_NE LABEL NEAR
00000060 0F B7 04 55 MOVZX EAX, gwr[(SIZEOF WORD) * EDX] ; EAX = 0000000000000000 LLLLLLLLLLLLLLLL
00000000 E
00000068 0F B7 14 4D MOVZX EDX, gwr[(SIZEOF WORD) * ECX] ; EDX = 0000000000000000 RRRRRRRRRRRRRRRR
00000000 E
00000070 23 C2 AND EAX, EDX ; AX = L&R (result)
00000072 80 35 00000000 E XOR BYTE PTR ereg, 1 ; E = ~E
01
00000079 A3 00000000 E MOV dvalue, EAX ; Save the temporary ALU/Shifter result
0000007E C3 RET ; Return
Solution
After the Pentium I or II, most optimizations performed by the compiler, were not as necessary. The chip will decompose these instructions into micro ops and then optimize for you. t could be the branch prediction differences between the chips or the fact that the XOR + RET is just as expensive as a plain RET. I'm not as familiar with what models of Pentiums you are looking at above to say. Another possibility is that it could also be a cache-line issue or hardware difference.
There may be something in the Intel docs or there may not.
Regardless. Experienced assembly coders know that the only truth is achieved via testing, which is what you are doing.
OTHER TIPS
It turns out that there is some curious interaction with where the code is located that causes the increase. Even though everything is cache aligned, switching the blocks of code caused the increase in time on the Pentium-4
Thanks to all who took the time to investigate this or look at it.
You can add one, two, etc nops in front of this code (and change nothing else) to move where this lands in the cache to see if there are cache effects (or just turn off the cache). Warning though as little as an extra nop can change an instruction elsewhere that can no longer reach something using relative to the pc addressing, causing possibly more instruction bytes causing both the code under test to move more than desired as well as possibly a chain reaction of other relatively addressed instructions to change.
Even if you play the cache game the nature of the beast here is the magic inside the chip that takes one stream of instructions and divides it up among the execution units.
Tweak and test is what really gets performance in the end even if you dont understand why. Although as soon as you move that code to an older chip or newer chip or different motherboard or same chip family but different stepping all your performance tweaks can turn on you.
A few months ago, I had something similar occur to me. My project has a configure-switch for enabling the use of __thread
for thread-local variables. Without it, it would use pthread_getspecific
and the likes. The latter does every bit as much as the __thread
version plus a function call plus some additional instructions for setting up arguments, saving registers, and so forth. Interestingly, the more laborious version was consistently faster. Only on Pentium 4, though. All other chips behaved sanely.