I think you did not understand what exactly is the pagefault. pagefault, according to Wikipedia, is a "trap" (exception), a kind of interrupt, which is generated by CPU itself when programs tries to access something, which is not loaded into physical memory (but usually already registered in virtual memory with its page marked as "not present" P: Present bit = 0).
Pagefault is bad, because it forces CPU to stop execution of user program and switch to kernel. And pagefaults in kernel mode are not so often, because kernel can check page presence before accessing it. If kernel function wants to write something to new page (in your case, the read
syscall), it will allocate page by calling page allocator explicitly, and not by trying to access it and faulting into pagefault. There are less interrupts, and less code to execute with explicit memory management.
--- read case ---
Your read is handled by sys_read
from fs/read_write.c. Here is call chain (possibly not exact):
472 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
479 ret = vfs_read(f.file, buf, count, &pos);
vvv
353 ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
368 ret = file->f_op->read(file, buf, count, pos);
vvv
626 const struct file_operations ext4_file_operations = {
628 .read = do_sync_read,
... do_sync_read -> generic_file_aio_read -> do_generic_file_read
1100 static void do_generic_file_read(struct file *filp, loff_t *ppos,
1119 for (;;) {
1120 struct page *page;
1127 page = find_get_page(mapping, index);
1128 if (!page) {
1134 goto no_cached_page;
// osgx - case when pagecache is empty ^^vv
1287 no_cached_page:
1288 /*
1289 * Ok, it wasn't cached, so we need to create a new
1290 * page..
1291 */
1292 page = page_cache_alloc_cold(mapping);
233 static inline struct page *page_cache_alloc_cold(struct address_space *x)
235 return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
vvv
222 static inline struct page *__page_cache_alloc(gfp_t gfp)
224 return alloc_pages(gfp, 0);
So I can track that read()
syscall ends in page allocation (alloc_pages
) via direct calls. After allocating page, read()
syscall will do DMA transfer of data from HDD into new page and then return to user (considering the case when file is not cached in pagecache). If the data was already in page cache, read()
(do_generic_file_read
) will reuse existing page from pagecache, without actual HDD read, by creating additional mapping.
After read()
returns, all data is in memory, and read access to it will not generate pagefault.
--- mmap case ---
If you rewrite the test to do mmap()
ing of your file and then access (text[offset]
) the non-present page of your file (it was not in pagecache), the real pagefault will occur.
All pagefault counters (perf stat
and /proc/$pid/stat
) are updated ONLY when real pagefault traps were generated by CPU. Here is x86 handler of page fault arch/x86/mm/fault.c, which will work
1224 dotraplinkage void __kprobes
1225 do_page_fault(struct pt_regs *regs, unsigned long error_code)
1230 __do_page_fault(regs, error_code);
vvv
1001 /*
1002 * This routine handles page faults. It determines the address,
1003 * and the problem, and then passes it off to one of the appropriate
1004 * routines.
1005 */
1007 __do_page_fault(struct pt_regs *regs, unsigned long error_code)
/// HERE is the perf stat pagefault event generator VVV
1101 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
and somewhere later pagefault handler will call handle_mm_fault
-> handle_pte_fault
-> __do_fault
ending in vma->vm_ops->fault(vma, &vmf);
.
This fault
virtual function was registered in mmap
, and I think it is filemap_fault
. This function will do actual page allocation (__alloc_page
) and disk read in case of empty pagecache (this will be counted as "major" pagefault, because it requires external I/O) or will remap page from pagecache (if the data was prefetched or already in the pagecache, counted as "minor" pagefault, because it was done without external I/O and generally faster).
PS: Doing experiments on the virtual platform may change something; for example, even after cleaning disk cache (pagecache) in the guest Fedora by echo 3 > /proc/sys/vm/drop_caches
, data from the virtual hard drive can be still cached by host OS.