Question

I am reading different sized files (1KB - 1GB) using read() in C. But everytime I check the page-faults using perf-stat, it always gives me the same (almost) values.

My machine: (fedora 18 on a Virtual Machine, RAM - 1GB, Disk space - 20 GB)

uname -a
Linux localhost.localdomain 3.10.13-101.fc18.x86_64 #1 SMP Fri Sep 27 20:22:12 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

mount | grep "^/dev"
/dev/mapper/fedora-root on / type ext4 (rw,relatime,seclabel,data=ordered)
/dev/sda1 on /boot type ext4 (rw,relatime,seclabel,data=ordered)

My code:

 10 #define BLOCK_SIZE 1024
. . . 
 19         char text[BLOCK_SIZE];
 21         int total_bytes_read=0;
. . .

 81         while((bytes_read=read(d_ifp,text,BLOCK_SIZE))>0)
 82         {
 83                 write(d_ofp, text, bytes_read); // writing to /dev/null
 84                 total_bytes_read+=bytes_read;
 85                 sum+=(int)text[0];  // doing this just to make sure there's 
                                             // no lazy page loading by read()
                                             // I don't care what is in `text[0]`
 86         }
 87         printf("total bytes read=%d\n", total_bytes_read);
 88         if(sum>0)
 89                 printf("\n");

Perf-stat output: (shows file size, time to read the file and the # of page faults)

[read]:   f_size:    1K B, Time:  0.000313 seconds, Page-faults: 150, Total bytes read: 980 
[read]:   f_size:   10K B, Time:  0.000434 seconds, Page-faults: 151, Total bytes read: 11172
[read]:   f_size:  100K B, Time:  0.000442 seconds, Page-faults: 150, Total bytes read: 103992
[read]:   f_size:    1M B, Time:  0.00191  seconds, Page-faults: 151, Total bytes read: 1040256
[read]:   f_size:   10M B, Time:  0.050214 seconds, Page-faults: 151, Total bytes read: 10402840 
[read]:   f_size:  100M B, Time:  0.2382   seconds, Page-faults: 150, Total bytes read: 104028372 
[read]:   f_size:    1G B, Time:  5.7085   seconds, Page-faults: 148, Total bytes read: 1144312092 

Questions:
1. How can the page-faults for a file read() of size of 1KB & 1GB be same ? Since I am reading the data too (code line #84), I am making sure the data is being actually read.
2. The only reason that I can think of that it doesn't encounter that many page-faults is because the data is already present in the main memory. If this is the case, how can I flush it so that when I run my code it actually shows me the true page-faults ? Otherwise I can never measure the true performance of read().

Edit1:
echo 3 > /proc/sys/vm/drop_caches doesn't help, the output still remains the same.

Edit2: For mmap, the output of perf-stat is:

[mmap]:   f_size:    1K B, Time:  0.000103 seconds, Page-faults: 14
[mmap]:   f_size:   10K B, Time:  0.001143 seconds, Page-faults: 151
[mmap]:   f_size:  100K B, Time:  0.002367 seconds, Page-faults: 174
[mmap]:   f_size:    1M B, Time:  0.007634 seconds, Page-faults: 401
[mmap]:   f_size:   10M B, Time:  0.06812  seconds, Page-faults: 2,688
[mmap]:   f_size:  100M B, Time:  0.60386  seconds, Page-faults: 25,545
[mmap]:   f_size:    1G B, Time:  4.9869   seconds, Page-faults: 279,519
Was it helpful?

Solution

I think you did not understand what exactly is the pagefault. pagefault, according to Wikipedia, is a "trap" (exception), a kind of interrupt, which is generated by CPU itself when programs tries to access something, which is not loaded into physical memory (but usually already registered in virtual memory with its page marked as "not present" P: Present bit = 0).

Pagefault is bad, because it forces CPU to stop execution of user program and switch to kernel. And pagefaults in kernel mode are not so often, because kernel can check page presence before accessing it. If kernel function wants to write something to new page (in your case, the read syscall), it will allocate page by calling page allocator explicitly, and not by trying to access it and faulting into pagefault. There are less interrupts, and less code to execute with explicit memory management.

--- read case ---

Your read is handled by sys_read from fs/read_write.c. Here is call chain (possibly not exact):

472 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
479                 ret = vfs_read(f.file, buf, count, &pos);
  vvv
353 ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
368                         ret = file->f_op->read(file, buf, count, pos);
  vvv

fs/ext4/file.c

626 const struct file_operations ext4_file_operations = {
628         .read           = do_sync_read,

... do_sync_read -> generic_file_aio_read -> do_generic_file_read

mm/filemap.c

1100 static void do_generic_file_read(struct file *filp, loff_t *ppos,
1119         for (;;) {
1120                 struct page *page;
1127                 page = find_get_page(mapping, index);
1128                 if (!page) {
1134                                 goto no_cached_page;  
  // osgx - case when pagecache is empty  ^^vv
1287 no_cached_page:
1288                 /*
1289                  * Ok, it wasn't cached, so we need to create a new
1290                  * page..
1291                  */
1292                 page = page_cache_alloc_cold(mapping);

include/linux/pagemap.h

233 static inline struct page *page_cache_alloc_cold(struct address_space *x)
235         return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
  vvv
222 static inline struct page *__page_cache_alloc(gfp_t gfp)
224         return alloc_pages(gfp, 0);

So I can track that read() syscall ends in page allocation (alloc_pages) via direct calls. After allocating page, read() syscall will do DMA transfer of data from HDD into new page and then return to user (considering the case when file is not cached in pagecache). If the data was already in page cache, read() (do_generic_file_read) will reuse existing page from pagecache, without actual HDD read, by creating additional mapping.

After read() returns, all data is in memory, and read access to it will not generate pagefault.

--- mmap case ---

If you rewrite the test to do mmap()ing of your file and then access (text[offset]) the non-present page of your file (it was not in pagecache), the real pagefault will occur.

All pagefault counters (perf stat and /proc/$pid/stat) are updated ONLY when real pagefault traps were generated by CPU. Here is x86 handler of page fault arch/x86/mm/fault.c, which will work

1224 dotraplinkage void __kprobes
1225 do_page_fault(struct pt_regs *regs, unsigned long error_code)
1230         __do_page_fault(regs, error_code);
  vvv
1001 /*
1002  * This routine handles page faults.  It determines the address,
1003  * and the problem, and then passes it off to one of the appropriate
1004  * routines.
1005  */
1007 __do_page_fault(struct pt_regs *regs, unsigned long error_code)
 /// HERE is the perf stat pagefault event generator VVV 
1101         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

and somewhere later pagefault handler will call handle_mm_fault -> handle_pte_fault -> __do_fault ending in vma->vm_ops->fault(vma, &vmf);.

This fault virtual function was registered in mmap, and I think it is filemap_fault. This function will do actual page allocation (__alloc_page) and disk read in case of empty pagecache (this will be counted as "major" pagefault, because it requires external I/O) or will remap page from pagecache (if the data was prefetched or already in the pagecache, counted as "minor" pagefault, because it was done without external I/O and generally faster).


PS: Doing experiments on the virtual platform may change something; for example, even after cleaning disk cache (pagecache) in the guest Fedora by echo 3 > /proc/sys/vm/drop_caches, data from the virtual hard drive can be still cached by host OS.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top