Linux Kernel Invalidating TLB Entries

https://stackoverflow.com/questions/8381531

28-10-2019
|

Question

In the linux kernel, I wrote code that resembles copy_page_range (mm/memory.c) so copy memory from one process to another with COW optimization. The destination and source addresses can be offset by PAGE_SIZE and COW still works. I noticed, however, that in a user program when I copy from the same source address to different destination addresses, the TLB does not seem to be properly flushed. At a high level, my user level code does the following (I copy exactly one page, 0x1000 bytes on my machine, at a time):

SRC=0x20000000

Write to SRC (call the associated page page1).
Syscall to copy SRC into 0x30000000 in destination process. Now, src process address 0x20000000 and destination process address 0x30000000 point to the same page (page1).
Write something different to SRC (this should trigger a page fault to handle the COW). Assume source address now points to page2.
Syscall to copy SRC into 0x30001000 in destination process.

At this point, two separate pages should exist: SRC 0x20000000 page2 DST 0x30000000 page1 DST 0x30001000 page2

I find that at step 3, when I write something different into src 0x20000000, no page fault is generated. Upon inspection, the actual page mappings are: SRC 0x20000000 page1 DST 0x30000000 page1 DST 0x30001000 page1

In my code, if I call flush_tlb_page and pass the source address, the user code works as expected with the proper page mappings. So I am convinced I am not maintaining the TLB correctly. In copy_page_range, the kernel calls mmu_notifier_invalidate_range_start/end before and after it alters page tables. I am doing the exact same thing and have double checked I am indeed passing the correct struct_mm and addresses to mmu_notifier_invalidate_range_start/end. Does this function not handle flushing the tlb?

Ok, so literally as I finished typing this, I checked dup_mmap and realized that the primary caller of copy_page_range, dup_mmap (kernel/fork.c), calls flush_tlb_mm. I am guessing I should call flush_cache_range and flush_tlb_range before and after my kernel code. Is this correct? What exactly does mmu_notifier_invalidate_range_start/end do?

Solution

Yes, if you are doing something that changes page tables, you need to make sure that the TLB is invalidated as required.

mmu_notifier_invalidate_range_start/end are just calling MMU notifier hooks; these hooks only exist so that other kernel code can be told when TLB invalidation is happening. The only places that set up MMU notifiers are

KVM (hardware assisted virtualization) uses them to handle swapping out pages; it needs to know about host TLB invalidations to keep the virtualized guest MMU in sync with the host.
GRU (driver for specialized hardware in huge SGI systems) uses MMU notifiers to keep the mapping tables in the GRU hardware in sync with the CPU MMU.

But pretty much any place that you are calling MMU notifier hooks, you should also be calling TLB shootdown functions if the kernel is not already doing it for you.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow