rdtsc code that shows performance impacts from memory characterstics such as TLB miss

Question

This code is simply looping through a 2MB buffer, writing 0 to each byte of it, and computing the time it takes to perform each write, updating a low- and high-water mark (min and max) that show the shortest & longest times required for making each write.

Assuming this program is the ONLY program running on the CPU, and assuming no asynchronous events occur while it's running (hardware interrupts or timer interrupts), this program would show you both the nominal time for making a byte-wide write to memory, and the maximum amount of time required for handling a TLB miss exception and/or page fault exception.

A TLB miss exception is an exception that the core takes when a program tries to access memory for which there is not a TLB entry in the MMU. The MMU is the police officer at the intersection of Core Avenue and Memory Lane, who directs traffic to where it's supposed to go. OK, that's a horrible analogy. The MMU (Memory Management Unit) has two main purposes: 1) route virtual memory accesses to the appropriate physical memory address, and 2) enforce read-only, read-write, read-execute, execute-only, etc. privileges so that a stray pointer access into a virtual memory region with conflicting attributes (or to an unmapped virtual memory region) will get trapped & raise a memory access exception (such as SIGSEGV on Linux). A TLB entry is a set of hardware registers in the MMU that tell the MMU the permissions of a virtual memory page or group of pages that are currently loaded into physical memory. But an MMU doesn't have an infinite number of TLB entries; it doesn't have nearly enough TLB entries to describe the attributes of all of the pages of memory. So if you try to access a legal address from your process's address space that doesn't have a current TLB entry describing the page in which it resides, you get a TLB miss exception. The TLB miss exception handler then fetches the proper TLB entry's data from main memory, and writes it into a TLB entry in the MMU; the MMU may even have some built-in mechanism for telling the TLB miss exception handler which TLB entry it should use... probably the least-recently-used entry, which is the one most likely to not be needed again in the near future.

A page fault is akin to a TLB miss exception, except that in this case, the content of that virtual memory page isn't even in physical memory... it may be altogether nonexistent (a newly-mapped page of memory), or it may have been previously swapped out to disk to make room in the limited physical memory for another page of virtual memory that the program needed at some point. While TLB miss exceptions are normally pretty fast (but do affect performance nonetheless), a page fault exception may be a HUGE hit to performance if the page has to be pulled off of disk (even from an SSD!), since disk storage is typically an order of magnitude slower (or worse!) than memory accesses. For this reason, to keep the CPU busy working on something that's useful, an operating system's page fault exception handler often causes the currently-running process to swap out in favor of running a different process (one that's in the "ready" state), pending receipt of the data off of disk for filling up the requested virtual memory page.

Now, back to this "test code" and the efficacy of its results:

This test depends on the OS+runtime NOT pre-allocating memory pages in the call to malloc(N). I believe this is probably typical behavior; even though the runtime has allocated that much memory & knows the address range that it allocated, the actual pages for that memory are often not allocated by the OS until your program actually accesses (reads or writes) an address in a given page. Pages are 4KB on many platforms, but could be much larger, too, such as 4MB pages on newer Intel Pentium derivatives.

So assuming your platform's page size is 4KB (4096 bytes), as your programs walks through the 2MB allocated space writing 0's to it a byte at a time, it will go through 1024 of these 4KB pages. So 4193280 of these writes should occur "as fast as possible" (without triggering a TLB miss or page fault exception). And up to 1024 of them will trigger TLB miss and/or page fault exceptions. So the 'min' time gives the fastest time possible to perform a write given that the written address resides in an already-loaded virtual memory page and its TLB entry is currently resident in the MMU. The 'max' time gives the worst possible time to perform a write, presumably to an address that resides in a page that is not yet mapped into physical memory (and which triggered a page fault exception, and perhaps also a TLB miss exception).

There are two problems with this test, if we're depending on its results to reveal some characteristics of the underlying hardware: 1) By itself, this code neglects the effect of process swapping and/or hardware interrupts for other reasons, such as time-slicing and network packets being received & processed "in the background" (which can interrupt the running process). And... 2) The 2MB test buffer isn't even as large as the 4MB page size of newer Intel processors' MMUs. I don't know what conditions dictate whether operating systems choose to use 4KB pages or 4MB pages, so this may or may not be a factor on your system. Just be aware that if your min and max are on the same order of magnitude as each other, then likely you're on a system with 4MB pages, and if your min and max differ by an order of magnitude or more, the difference may not be entirely attributable to TLB miss and page fault exceptions. Perhaps this is why the author hedged a bit in his statement that the code "may show you some performance impacts..." (emphasis added).