Machine code generation, memory access / register operation patterns and performance?

https://stackoverflow.com/questions/18282423

24-06-2022
|

Domanda

I had a really hard time formulating a title for this question, and I don't think I did very well, so if anyone has a better idea, the edit button is yours.

Considering that memory operations cost 3-4 cycles in the absolutely best scenario and potentially more, and that reading of data that is "narrower" than the memory bus is sub-optimal, isn't the current generated structure of assembly language sub-optimal too?

Register operations take significantly less time, so why isn't the assembly fetching all the data that is needed to evaluate an expression before the expression and quickly execute it, reducing thread switching and allowing the processor to execute other threads.

get data 1 - 4 cycles
perform calculation 1 - 1 cycle
get data 2 - 4 cycles
perform calculation 2 - 1 cycle
get data 3 - 4 cycles
perform calculation 3 - 1 cycle

In the end, there are 15 cycles of CPU use.

get all data sequentially - 8 cycles
perform calculation 1 - 1 cycle
perform calculation 2 - 1 cycle
perform calculation 3 - 1 cycle

11 cycles used, which is a 25% improvement. Also, the actual CPU is only busy for 3 cycles, as memory is fetched by the dedicated on-chip hardware controller and idle for a much longer window.

I suppose the CPU could schedule other code for execution while waiting for the data in the first "example" as well, but for much shorter window, and with a cycle penalty for switching the context it would hardly be worth it, I think the second approach, while more register hungry, should result in better overall CPU performance. After all, modern processors all have at least 16 registers, even the current generation of new mobile device ARM chips have 32 registers. So why being so conservative? Perhaps compilers are still lingering in the days of 8 register machines?

Does this assumption hold true, or maybe current CPU architecture is not designed to benefit of such a mechanism? I assume that while the CPU is waiting for data, it can execute other code, especially considering most modern processors are out-of-order, so in the end, worst case scenario, you will waste the same time on fetching the data, but having all the data will allow for the code fragment to be executed much faster and therefore stall the processor for a shorter amount of time.

Soluzione

CPUs don't switch threads, schedulers do.

Modern CPUs don't execute instructions one at a time in strict order. They perform speculative fetches and read coalescing ahead of time precisely to avoid the delays you're talking about.

Also, on a modern machine, if a fetch has to go all the way to RAM (called an "L2 miss"), the penalty is more like 200 cycles.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow