Instruction Level Profiling: The Meaning of the Instruction Pointer?

Question 1

CPUs maintains a fiction that there are only the architectural registers (RAX, RBX, etc) and there is a specific instruction pointer (IP). Programmers and compilers target this fiction.

Yet as you noted, modern CPUs don't execute serially or in-order. Until you the programmer / user request the IP, it is like Quantum Physics, the IP is a wave of instructions being executed; all so that the processor can run the program as fast as possible. When you request the current IP (for example, via a debugger breakpoint or profiler interrupt), then the processor must recreate the fiction that you expect so it collapses this wave form (all "in flight" instructions), gathers the register values back into architectural names, and builds a context for executing the debugger routine, etc.

In this context, there is an IP that indicates the instruction where the processor should resume execution. During the out-of-order execution, this instruction was the oldest instruction yet to complete, even though at the time of the interrupt the processor was perhaps fetching instructions well past that point.

For example, perhaps the interrupt indicates mov RSI, [RBX + RCX]; as the IP, but the xor had already executed and completed; however, when the processor would resume execution after the interrupt, it will re-execute the xor.

Question 2

It's a good question, but in the kind of performance tuning I do, it doesn't matter. It doesn't really matter because what you're looking for is speed-bugs. These are things that the code is doing that take clock time and that could be done better or not at all. Examples:
- Spending I/O time looking in DLLs for resources that don't, actually, need to be looked for.
- Spending time in memory-allocation routines making and freeing objects that could simply be re-used.
- Re-calculating things in functions that could be memo-ized.
... this is just a few off the top of my head

Your biggest enemy is a self-congratulatory tendency to say "I wouldn't consciously write any bugs. Why would I?" Of course, you know that's why you test software. But the same goes for speed-bugs, and if you don't know how to find those you assume there are none, which is a way of saying "My code has no possible speedups, except maybe a profiler can show me how to shave a few cycles."

In my half-century experience, there is no code that, as first written, contains no speed-bugs. What's more, there's an enormous multiplier effect, where every speed-bug you remove makes the remaining ones more obvious. As a contrived example, suppose bug A accounts for 90% of clock time, and bug B accounts for 9%. If you only fix B, big deal - the code is 11% faster. If you only fix A, that's good - it's 10x faster. But if you fix both, that's really good - it's 100x faster. Fixing A made B big.

So the thing you need most in performance tuning is to find the speed-bugs, and not miss any. When you've done all that, then you can get down to cycle-shaving.