Why do more Pentium assembly instructions take less time?

https://stackoverflow.com/questions/1099225

11-09-2019
|

Question

Below is a clip from a listing of two Pentium assembly sequences. We have an outside loop that is trying to time our sequences and is doing a call-through-table to get to these routines. Therefore, the outside call is being made from the same location every time. The two sequences differ in that the first one has one less instruction than the second.

The results we get on two Intel machines are very different.

The CPUID instruction tells the Family, Model, and Stepping.

Machine 1: Family 6, Model 15 Stepping 11. CPUZ reports "Intel Core 2 Duo E6750"
The instructions execute at statistically the same speed.

Machine 2: Family 15, Model 3, Stepping 3. CPUZ reports "Intel Pentium 4"
The first sequence takes about 8% longer than the second sequence.

We simply can not explain the increase in time. There should not be any different flag hold-off, prediction of branches, register usage problems, etc. At least not that we can tell.

Does anyone have an idea why the first sequence would take longer to execute on the one machine?

Edit: Adding "XOR PTR ereg, 0" to the first sequence does make the timing match the second one on the Pentium 4. Curious.

First Sequence:

00000040               ALUSHIFT_AND_C_V_E LABEL NEAR
00000040  0F B7 04 55       MOVZX   EAX, gwr[(SIZEOF WORD) * EDX]       ; EAX = 0000000000000000 LLLLLLLLLLLLLLLL
   00000000 E
00000048  0F B7 14 4D       MOVZX   EDX, gwr[(SIZEOF WORD) * ECX]       ; EDX = 0000000000000000 RRRRRRRRRRRRRRRR
   00000000 E
00000050  23 C2             AND     EAX, EDX                            ; AX = L&R      (result)
00000052  A3 00000000 E     MOV     dvalue, EAX                         ; Save the temporary ALU/Shifter result
00000057  C3                RET                                         ; Return

Second Sequence:

00000060               ALUSHIFT_AND_C_V_NE LABEL NEAR
00000060  0F B7 04 55       MOVZX   EAX, gwr[(SIZEOF WORD) * EDX]       ; EAX = 0000000000000000 LLLLLLLLLLLLLLLL
   00000000 E
00000068  0F B7 14 4D       MOVZX   EDX, gwr[(SIZEOF WORD) * ECX]       ; EDX = 0000000000000000 RRRRRRRRRRRRRRRR
   00000000 E
00000070  23 C2             AND     EAX, EDX                            ; AX = L&R      (result)
00000072  80 35 00000000 E  XOR     BYTE PTR ereg, 1                    ; E = ~E
   01
00000079  A3 00000000 E     MOV     dvalue, EAX                         ; Save the temporary ALU/Shifter result
0000007E  C3                RET                                         ; Return

Solution

After the Pentium I or II, most optimizations performed by the compiler, were not as necessary. The chip will decompose these instructions into micro ops and then optimize for you. t could be the branch prediction differences between the chips or the fact that the XOR + RET is just as expensive as a plain RET. I'm not as familiar with what models of Pentiums you are looking at above to say. Another possibility is that it could also be a cache-line issue or hardware difference.

There may be something in the Intel docs or there may not.

Regardless. Experienced assembly coders know that the only truth is achieved via testing, which is what you are doing.

OTHER TIPS

It turns out that there is some curious interaction with where the code is located that causes the increase. Even though everything is cache aligned, switching the blocks of code caused the increase in time on the Pentium-4

Thanks to all who took the time to investigate this or look at it.

You can add one, two, etc nops in front of this code (and change nothing else) to move where this lands in the cache to see if there are cache effects (or just turn off the cache). Warning though as little as an extra nop can change an instruction elsewhere that can no longer reach something using relative to the pc addressing, causing possibly more instruction bytes causing both the code under test to move more than desired as well as possibly a chain reaction of other relatively addressed instructions to change.

Even if you play the cache game the nature of the beast here is the magic inside the chip that takes one stream of instructions and divides it up among the execution units.

Tweak and test is what really gets performance in the end even if you dont understand why. Although as soon as you move that code to an older chip or newer chip or different motherboard or same chip family but different stepping all your performance tweaks can turn on you.

A few months ago, I had something similar occur to me. My project has a configure-switch for enabling the use of __thread for thread-local variables. Without it, it would use pthread_getspecific and the likes. The latter does every bit as much as the __thread version plus a function call plus some additional instructions for setting up arguments, saving registers, and so forth. Interestingly, the more laborious version was consistently faster. Only on Pentium 4, though. All other chips behaved sanely.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow