Time taken to load value from registers in x86

Question 1

It is very, very important that you understand that anything this book tells you about code execution speed is complete nonsense. The book is very old, 15 years is a lot of dog lives in processor development. Even what is visible from your screenshot was already no longer true back then, it gets dangerously untrue today.

Next, the CPU computes the address of the memory operand

No, not really. Operand address calculation was the job of the AGU, the "Address Generation Unit". A separate circuit on the processor core that ran independently from the main execution core. Which is why doing extra work can take 0 cpu cycles, the work is done concurrently. That did not stop at just the AGU, modern processors have many execution units that can complete jobs at the same time.

We will assume that clock cycles and memory cycles are equivalent

Not true back then, horribly untrue today. The memory bus is hundreds of times slower than the processor core. A problem that's related to distance, the further an electrical signal has to travel, the harder it gets to have it delivered at the destination without it getting corrupted. Only going slower can solve that problem. Modern processor with gigahertz clock rates invest heavily in caches, extra memory that stores a copy of the data in RAM. The L1 cache is very important, it can store 32 kilobytes of data and 32 kilobytes of instructions and sits closest to the processor core. Still takes 3 cpu cycles to read. L2 and L3 are bigger and, inevitably, sit further away and thus take more cycles. Any program that suffers an execution stall because it takes 150 cpu cycles to read the data from RAM will be a very poorly performing program of course, regardless of what instructions it uses.

This is not where the discomfort stops, the entire premise of the book is very misleading today. Modern processors don't actually execute x86 instructions. They have the equivalent of a just-in-time compiler, the kind that is used in Java or .NET. They translate x86 instructions into "micro-ops", CISC instructions getting translated into RISC instructions. The kind that are easy to execute out-of-order and concurrently across multiple execution sub-units. Exactly what that looks like is a very well kept secret, companies like Intel and AMD treat it like intellectual property that nobody should know anything about. And most of all, nobody should ever take a dependency on since that will make it difficult for them to improve their processor designs.

A clear casualty of this innovation is that talking about an instruction taking a certain number of CPU cycles is no longer meaningful. I already pointed you to Agner Fog's manuals. It talks about latency, the amount of time it takes to decode an instruction and get a result. And throughput, affected by the number of identical instructions that can be executed at the same time. Those numbers only give you a hint at how hard a processor needs to work, they are completely useless to predict actual execution time of a program. Add the state of the caches, the speed of the memory bus, the ability of the prefetcher to guess what memory location needs to be retrieved ahead of time, and the amount of luck the branch predictor has at guessing at the code flow as strong randomizers. Only a profiler can tell you how long it took.

Question 2

Others have pointed out that the book you are reading is pretty old, so what it tells you about instruction timing isn't really relevant today. That bit about "memory cycles" and "clock cycles" being equivalent especially dates the book back to the early 80s.

How many clock cycles an instruction takes to execute depends on the data dependencies the instruction triggers, and how much machinery the CPU instruction set designers have focused on optimizing the instruction decoding and execution.

Much older machines would use many clocks to fetch and decode complex instructions, and then often a few clocks to access memory (in the late 80s clock rates were 10 Mhz). The many-clocks-to-decode was caused by complex instruction set design, the absence of lots of resources (space and transistors) to throw at instruction decoding, and long delays through much larger silicon geometries. Memories had access times of 70 ns so only required only a few clocks (e.g., 10) because the clock cycle was a lot slower.

Being much simpler in implementation than modern CPUs, there was often a straightforward finite state machine internal to the CPU that controlled instruction execution. By knowing roughly how this FSA worked, you could predict instruction decoding times, including times for decoding memory addressing, as your book suggests. No longer true.

Modern machines have very high clock rates: 2-4 Ghz. That's like 1000x as fast. This is possible because transistors are much smaller and electricity thus takes less time to cross them. Furthermore, with so many transistors, designers can throw a ton of extra silicon at decoding/executing and caching. Oddly, memories haven't gotten a lot faster, so the relative time it takes to access memory in clocks has gone surprisingly way up. 40 ns memories take 160 4Ghz clocks.

Arguably, a chip could decode an instruction, and store everything it needs to know about executing that instruction in a cache (and modern Intel CPUs do much of this for complex instructions). That means a complex instruction might take tens of cycles to decode the first time encountered, and a single clock ("look it up in the decoded instruction cache") when encountered again, which happens frequently (consider loops). The good news here is that average instruction decode times are pretty small; the bad news is that you cannot make a hard estimate of how long an instruction takes to decode and run because it depends on how many optimizations and caches are involved, whether the caches get hit or not for a particular instruction execution, and what other instructions are still be processed and have precedence for resource access due to data dependencies.

As a practical matter, transistors are still in (huge but) limited supply, so the instruction decoder can't cache everything. The designers still make tradeoffs. A key tradeoff is that "simple" (RISC) instructions (which tend to be executed a lot) have lots of resources assigned to enable them to be decoded quickly on every encounter, and complex instructions (which tend to executed less often) get fewer hardware resources thrown at them.

Specific to the question of how long does it take to decode a register access: It is clear that accessing a register, first requires you know which register to access, even if it is on chip. So after reading the instruction, it must take some time for the register field to be extracted. If you believe the least time on a chip is measured in clocks, it takes more than zero, and up to at least 1 clock to get that field. Really aggressive hardware may do multiple actions during a single clock cycle, so in fact one can design CPUs that decode-a-register, and fetch-the-designated register is less than one clock. Some modern Intel processors decode pairs of instructions such as "compare; jmp conditional" in a single step.

From your point of view, this means that instructions on average tend to execute pretty fast. What hurts are memory accesses to non-cached locations.

My first-order model of CPUs is they are infinitely fast and all you have to do is worry about memory access. This means I tend to trade computing in registers (which is fast) for memory access times. Compress your data structures, it doesn't hurt :-}

Question 3

The zero cycles for register operands means the latency of a register-only operation is the baseline for the operation (the registers have a fast connection to the ALU). When a memory operand is used, memory access latency is added to the operation latency.
Part of the memory access latency is address calculation. The register containing the address (or part of it, in complicated addressing modes) must be routed to the CPU's addressing unit instead of its ALU.
The address is then used to access memory, at which point the question whether routing in the CPU took zero or one cycle becomes ridiculous: Memory latencies can be orders of magnitude greater.
Bottom line: no one cares about that cycle.