Question

I am observing the last few Intel microarchitectures (Nehalem/SB/IB and Haswell). I am trying to work out what happens (at a fairly simplified level) when a data request is made. So far I have this rough idea:

  1. Execution engine makes data request
  2. "Memory control" queries the L1 DTLB
  3. If the above misses, the L2 TLB is now queried

At this point two things can happen, a miss or a hit:

  1. If its a hit the CPU tries L1D/L2/L3 caches, page table and then main memory/hard disk in that order?

  2. If its a miss- the CPU requests the (integrated memory controller?) to request checking the page table held in RAM (did I get the role of the IMC correct there?).

If somebody could edit/provide a set of bullet points which provide a basic "overview" of what the CPU does from the execution engine data request, including the

  • L1 DTLB (data TLB)
  • L2 TLB (data + instruction TLB)
  • L1D Cache (data cache)
  • L2 cache (data + instruction cache)
  • L3 cache (data + instruction cache)
  • The part of the CPU which controls access to main memory
  • Page table

it would be most appreciated. I did find some useful images:

but they didn't really separate the interaction between the TLBs and the caches.

UPDATE: Have changed the above as I think I now understand. The TLB just gets the physical address from the virtual one. If there's a miss- we're in trouble and need to check page table. If there's a hit we just proceed down through the memory hierarchy starting with the L1D cache.

Was it helpful?

Solution

The pagemap is only applicable for virtual to physical address translation. However, as it's residing in memory and only partially cached in the TLBs, you may have to access it there during the translation process.

The basic flow is as follows:

  1. Execution calculates the address (actually some calculations like scale and offsets could be done in the memory unit).
  2. Lookup in the DTLB
    2.a. If missed, lookup in the 2nd level TLB.
    2.a.a. if missed - start a page walk.
    2.a.b. if hit the 2nd level TLB, fill into the DTLB and proceed with the new physical address
    2.b. is hit in the DTLB proceed with physical address
  3. Lookup the L1, if missed - lookup the L2, if missed again lookup the L3, if missed - send to the memory controller, wait for DRAM access.
  4. When data returns (from whichever level), fill in to the caches along the way (depending on fill policy, cache inclusiveness, and instruction temporality specifications, memory region type, and probably other factors as well).

If a pagewalk was required, stall main request, and issue physical loads to the pagemap (according to the architectural definition). In x86 it may include CR3, PDPTR, PDP, PDE, PTE, etc.. depending on the paging mode, page sizes, etc.. Note that under virtualization, each pagewalk level on the VM may require a full pagewalk on the host (so you actually square the number of steps needed).

Note that a pagemap is basically a tree structure, where each access depends on the value of the previous one (and part of the virtual address you translate). These accesses are therefore dependent, and only once the last one is done you get the physical address and can go back to #3. All along, the line you want may be sitting in your L1 without you being able to know (although to be honest, if you did a pagewalk you're not likely to still have the line in your upper caches).

Other important notes - the pagemap is in physical space and accessed that way. You don't want to have to translate the accesses you need for translation, that could be a deadlock :)
More importantly, the pagemap data can be cached, so while a simple memory access may expand to multiple ones due to a TLB miss, the pagewalk may still be fairly cheap.

OTHER TIPS

Yes, as explained in a long description here:

http://lwn.net/Articles/252125/

the passage from CPU to L1 to L2 to L3 is pictorially illustrated.

enter image description here

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top