NASM Prefetching

Question 1

These instructions are hints used to suggest that the CPU try to prefetch a cache line into the cache. Because they're hints, a CPU can ignore them completely.

If the CPU does support them, then the CPU will try to prefetch but will give up (and won't prefetch) if a TLB miss would be involved. This is where most people get it wrong (e.g. fail to do "preloading", where you insert a dummy read to force a TLB load so that prefetching isn't prevented from working).

The amount of data prefetched is 32 bytes or more, depending on CPU, etc. You can use CPUID to determine the actual size (CPUID function 0x00000004, the "System Coherency Line Size" returned in EBX bits 0 to 31).

If you prefetch too late it doesn't help, and if you prefetch too early the data can be evicted from the cache before it's used (which also doesn't help). There's an appendix in Intel's "IA-32 Intel Architecture Optimisation Reference Manual" that describes how to calculate when to prefetch, called "Mathematics of Prefetch Scheduling Distance" that you should probably read.

Also don't forget that prefetching can decrease performance (e.g. cause data that is needed to be evicted to make room) and that if you don't prefetch anything the CPU has a hardware prefetcher that will probably do it for you anyway. You should probably also read about how this hardware prefetcher works (and when it doesn't). For example, for sequential reads (e.g. memcmp()) the hardware prefetcher does it for you and using explicit prefetches is mostly a waste of time. It's probably only worth bothering with explicit prefetches for "random" (non-sequential) accesses that the CPU's hardware prefetcher can't/won't predict.

Question 2

After sifting through some examples of heavily-optimized memcmp functions and the like, I've figured out how to use these instructions (somewhat) effectively.

These instructions imply a cache "line" of 32 bytes, something I missed originally. Thus, to cache a 256 byte buffer into L1 and L2, the following instruction set could be used:

prefetcht1 [buffer]
prefetcht1 [buffer+32]
prefetcht1 [buffer+64]
prefetcht1 [buffer+96]
prefetcht1 [buffer+128]
prefetcht1 [buffer+160]
prefetcht1 [buffer+192]
prefetcht1 [buffer+224]

The t0 suffix instructs the CPU to prefetch it into the entire cache hierarchy.

t1 instructs that the data be cached into L1, L2, and so on.

t2 continues this trend, prefetching into L2 and such.

The "nta" suffix is a bit more confusing, as it tells the CPU to write the data straight to memory (ideally), as opposed to reading/writing cache lines. This can actually be quite useful in the case of incredibly large data structures, as cache pollution can be avoided and more relevant data can instead be cached.