What is the different between misses per instruction and miss penalty in computer execution time?

Question

Misses per instruction (MPI, or better known as MPKI misses per 1000-instructions) is a statistic that describes how a given workload behaves on a specific machine with given cache hierarchy. It depends on both the access pattern (is your code accessing the same addresses over and over, or new ones all the time? What is the "reuse distance" for a recurring address?), as well as the cache sizes, associativity, replacement policies and so on. It's also specific per cache - a given workload will have different MPKI values for the L1 cache, L2, and L3 (or any other cache hierarchy you might have).

By the way - another similar stat is the cache hit rate (number of hits out of total accesses) - these two are related according to the number of memory operations per X instructions.

All of the above may hint on how effectively your caches behave for the given code (whether this code is experiencing memory-bottlenecks due to cache behavior), but it still doesn't tell you how fast each access is. For that you should measure the miss penalty - how many cycles do you have to wait for a line to be fetched from the next level of cache. This differs per system of course, but on a modern CPU for e.g. it's quite common to expect a few (3-5 cycles) for an access hitting the L1, ~10 for an access that missed the L1 and hit the L2, ~30 cycles for an access going up to the L3, and ~100 or even more for an access missing all cache levels and going all the way to memory.

The combination of these two stats can roughly describe how long you would spend on memory accesses in your code, but it still doesn't tell you how long the runtime would be - it depends how the accesses may be interleaved (if there's no dependency and they may be committed in parallel), given how many outstending requests your CPU supports (each level depends on different resources and limitations, such as line fill buffers, memory credits, bus bandwidth, etc..). Traversing over a linked list for e.g. means that all accesses are serialized, so if you have a high miss rate you'll have to accumulate the miss penalties. Parallel accesses would be faster, although as I said - are still bounded by physical limitations.

In overall, these stats are extremely useful when you want to profile and optimize your code, as they may indicate where the bottlenecks are, and how much your system makes you pay for them.