Slow Instructions in Simple Loop on x86

Question 1

As mentioned by Mystical, the timings you are looking at are not one to one the responsibility of the instructions it is shown against.

Modern processors run many instructions in parallel (the imul and the add 4 to eax can both run in parallel, also the math involved in the mov addressing uses the ALU too and can be computed before the imul completes).

The way most profilers compute their timing is by using timed interrupts and what you see are the instructions that happened to be the ones executed at the time of the interrupts.

To properly use a profiler, you want to run against large programs and see whether the program spends a lot of time. On a per instruction basis, it does not have much value.

If you really want to do speed tests, you want to use the CPU timer before and after your loops and see how you can ameliorate it one way or another to get it to run faster.

Question 2

I wouldn't assume that it all fits in your L1, because the code you're debugging isn't the only thing using the CPU (unless you've booted your machine to run that code, which in fact will be your operating system).

Also note that there's a pattern there: The slowest operations all are requiring main memory access. Since this access time isn't controlled by the CPU, it's difficult to point out why isn't it faster. That will require hardware analysis.

Hope this helps.

Question 3

Unfortunately, you have not given the amount of time needed for a single pass through your loop, but I assume that it's three CPU cycles. If that is true, the three instructions that happen to get time attributed to them are the three instructions which the processor is officially on when the clock ticks. The other three instructions are executed in parallel to the three officially time consuming instructions, hiding behind them.